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Foreword 


A ACROSS Ocose 


Autonomous Control for a Reliable Internet of Services EUROPEAN COOPERATION 
IN SCIENCE &TECHNOLOGY 


This book was prepared to play the role of a publication and dissemination platform 
of the technical aspects of the Final Report of the COST Action IC1304 “Autonomous 
Control for a Reliable Internet of Services (ACROSS)” that has run for four years, from 
Fall 2013 until Fall 2017. COST (European Cooperation in Science and Technology) is 
an EU funding agency for research and innovation networks that enables researchers to 
set up their interdisciplinary networks in Europe and beyond. In particular, the main 
goal of the COST Action ACROSS was to create a European network of experts, 
aiming at the development of monitoring and autonomous control methods for a 
reliable and quality-aware future Internet of Services (IoS). As usual for COST 
Actions, the collaboration within ACROSS proceeded on the basis of a Memorandum 
of Understanding (MoU) setting out its main objectives and technical scope. 

The relevance of the IoS paradigm has been emphasized by the rapid developments 
regarding network softwarization (SDN, NFV, dew-, fog-, edge-, and cloud computing, 
etc.) in the course of the Action. This has raised many new research challenges and the 
need for the development of new methods to ensure the reliability of services offered 
via the IoS. 

ACROSS has attracted many researchers. It has consistently grown over the years 
and has evolved into a powerful eco-system that consists of over 100 international 
experts from 31 European countries, where both academia and industry are well 
represented. 

To support the realization of the Action’s main goals, we have organized 
semi-annual Management Committee (MC) meetings and co-located technical meet- 
ings, open international workshops on dedicated research topics within the Action’s 
scope, and international Summer Schools for training of PhD students and other 
early-stage researchers (ESR) in the field. In addition, ACROSS has also funded many 
so-called short-term scientific missions (STSM) to enable short international research 
visits. 

This book contains chapters written by various groups of co-authors that cover a 
broad range of research challenges and topics addressed by them during the course 
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of the Action. We emphasize that the range of topics is based on the preferences and 
research interests of the members of these different groups. 

The Action has been successful in establishing many new Pan-European research 
collaborations, and has boosted the career of a large number of participants. This book 
is the product of a fruitful informal collaboration, and we hope that it will be received 
in the same spirit that motivated its co-authors. 


March 2018 Ivan Ganchev 
R. D. van der Mei 
Hans van den Berg 


Preface 


The explosive growth of the Internet has fundamentally changed global society. The 
emergence of concepts like service-oriented architecture (SOA), software as a service 
(SaaS), platform as a service (PaaS), infrastructure as a service (IaaS), network as a 
service (NaaS), and cloud computing in general has catalyzed the migration from the 
information-oriented Internet to an Internet of Services (IoS). This has opened up 
virtually unbounded possibilities for the creation of new and innovative services that 
facilitate business processes and improve the quality of life. However, this also calls for 
new approaches to ensure quality and reliability of these services. To overcome current 
shortcomings, a huge number of research challenges have to be addressed in this area, 
ranging from the initial conceptualization and modelling, to the elaboration of suitable 
approaches, techniques, and algorithms, and to the development of suitable tools and 
the elaboration of realistic use-case scenarios by also taking into account corresponding 
societal and economical aspects. 

The objective of this book is, by applying a systematic approach, to assess the state 
of the art and consolidate the main research results achieved in this area. It was 
prepared as a final publication of the COST Action IC1304 “Autonomous Control for a 
Reliable Internet of Services (ACROSS).” The book contains 14 chapters and is a 
showcase of the main outcomes of the Action in line with its scientific goals. The book 
can serve as a valuable reference for undergraduate students, postgraduate students, 
educators, faculty members, researchers, engineers, and research strategists working in 
this field. 

The book chapters were collected through an open, but selective, three-stage 
submission/review process. An open call for contributions was distributed among the 
COST ACROSS community in October 2016. In order to ensure a good book quality, 
reduce the overlap, and increase the level of synergy between different research groups 
working on similar problems, the leaders of the Task Forces, established within 
ACROSS, were asked to coordinate and consolidate the initial chapter proposals. As a 
result, a total of 17 extended abstracts were received in response to the call. These were 
reviewed by the book editors and their authors were invited to the next stage of 
full-chapter submission. At the end of this stage, 15 full-chapter proposals were 
received by the set deadline. All submitted chapters were peer-reviewed by indepen- 
dent reviewers (including reviewers outside the COST Action ACROSS), appointed by 
the book editors, and after the first round of reviews 14 chapters remained. These were 
duly revised according to the reviewers’ comments, suggestions, notes, etc. and finally 
were accepted for publication in this book. 

The first chapter entitled “State of the Art and Research Challenges in the Area of 
Autonomous Control for a Reliable Internet of Services” serves as an introduction to 
this book. For this, it first analyzes the state of the art in the area of autonomous control 
for a reliable IoS and then identifies the main research challenges within it. A general 
background and high-level description of the current state of knowledge are presented. 
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Then, for each of the three subareas — autonomous management and real-time control, 
methods and tools for monitoring and service prediction, and smart pricing and 
competition in multi-domain systems — a brief general introduction and background are 
presented, and a list of key research challenges is formulated. 

The second chapter, “Context Monitoring for Improved System Performance and 
QoE,” is focused on the potential of enhancing the quality of experience (QoE) man- 
agement mechanisms by exploiting valuable context information. First, a general 
framework for context monitoring is discussed along with the context information, 
including technical, usage, social, economic, temporal, and physical factors. Then 
opportunities, challenges, and benefits of including context in the QoE monitoring and 
management are considered. The benefits are demonstrated through use cases involving 
video flash crowds, and online and cloud gaming. Finally, potential technical real- 
izations of context-aware QoE monitoring and management, based on the software 
defined networking (SDN) paradigm, are discussed. 

The concept of QoE management is also treated in the next chapter “QoE Man- 
agement for Future Networks,” which provides an introduction to this concept by 
discussing its origins and key terms, and gives an overview of the most relevant 
existing theoretical frameworks. Promising technical approaches to QoE-driven man- 
agement, provided across different layers of the networking stack, are also discussed 
along with an outlook on the future of the QoE management with a focus on the key 
enablers that are essential for ultimate transfiguration of the QoE-aware network and 
application management into reality. 

Staying on the same note, the chapter “Scalable Traffic Quality and System Efficiency 
Indicators Towards Overall Telecommunication System’s QoE Management” delves into 
the conceptual and analytical models of overall telecommunication systems, and the 
definition of scalable indicators on each system level for QoS monitoring and prediction, 
and toward QoE management. Two network cost/quality integral criteria are proposed — 
mean and instantaneous — along with illustrative numerical predictions of the latter, which 
could be used for dynamic execution of pricing policies, depending on the network load. 

The next chapter “Lag Compensation for First-Person Shooter Games in Cloud 
Gaming” continues by exploring the impact of latency, known as lag, on QoE for 
so-called first-person shooter cloud games. The authors, firstly, describe their approach 
for lag compensation, based on real-time equalization (within reason) of the uplink and 
downlink delays for all game players. Secondly, they describe the testbed (the 
open-source Gaming Anywhere platform), the use of the network time protocol 
(NTP) to synchronize time, the network emulator, and the role of the centralized log 
server. At the end the authors present results, validating their approach, along with 
small-scale and preliminary subjective tests for assessing its performance, and conclude 
the chapter by outlining ongoing and future work. 

This is followed by the chapter entitled “The Value of Context-Awareness in 
Bandwidth-Challenging HTTP Adaptive Streaming Scenarios,” which analyzes an 
adaptive streaming technology, based on the hypertext transfer protocol (HTTP), which 
adapts the video reproduction to the current prevailing network conditions. Particularly, 
the authors study how context awareness can be combined with the adaptive streaming 
logic to design a proactive context-aware client-based video streaming strategy, 
showing promising results for successful mitigation of video stalling due to network 
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connectivity problems. The authors analyze the performance of this strategy by com- 
paring it with the optimal case, as well as by considering situations where context 
awareness lacks reliability. 

The next chapter, entitled “Conceptual and Analytical Models for Predicting the 
Quality of Service of Overall Telecommunication Systems,” presents scalable con- 
ceptual and analytical performance models of overall telecommunication systems, 
allowing the prediction of multiple quality of service (QoS) indicators as functions 
of the user and network behavior. The authors consider two conceptual model pre- 
sentation structures along with an analytical method for conversion between them, and 
propose corresponding additive and multiplicative metrics for practical use. An ana- 
lytical model, allowing the prediction of flow, time, and traffic characteristics of the 
overall network performance, is elaborated. Differentiated QoS indicators, as well as 
analytical expressions for their prediction, are proposed. The results demonstrate the 
ability of the proposed model to facilitate a more precise dynamic QoS management as 
well as to predict some QoE indicators. 

The chapter “QoS-Based Elasticity for Service Chains in Distributed Edge Cloud 
Environments” is focused on elasticity as a dominant system engineering attribute for 
providing QoS-aware services to users by the emerging Internet of Things (loT) and 
cloud-based networked systems relying heavily on virtualization technologies. Even 
though the concept of elasticity can introduce significant QoS and cost benefits, in 
distributed systems with several layers of abstraction, controlling the elasticity in a 
centralized manner could strongly penalize scalability. To address this problem, the 
authors propose an approach of splitting the system in autonomous subsystems, which 
implement elasticity mechanisms and run control policies in a decentralized manner, 
and coordinate elasticity decisions that collectively improve the overall system per- 
formance. The authors’ focus is on design choices that may affect the elasticity 
properties. For this, an overview of some decentralized design patterns, related to the 
coordination of elasticity decisions, is provided as well. 

The next chapter “Integrating SDN and NFV with QoS-aware Service Composition” 
provides an overview of QoS-aware strategies that can be used at the network 
abstraction levels aiming to fully exploit the new network opportunities of full inte- 
gration of heterogeneous hardware and software functions, configured at runtime, with a 
minimal time-to-market cycle, provided to end-users on a “as a service” basis. More 
specifically, the authors present three use cases of integrating SDN and network function 
virtualization (NFV) technologies with QoS-aware service composition, ranging from 
the energy-efficient placement of virtual network functions inside modern data centers, 
to the deployment of data stream processing applications using SDN to control the 
network paths, and to exploiting SDN for context-aware service compositions. 

By stating that energy awareness and capability to deliver multimedia content with 
different possible combinations of quality and cost require complex optimization 
frameworks, the chapter “Energy vs. QoX Network- and Cloud Services Management” 
emphasizes that it is necessary to define more flexible paradigms by taking into account 
other design parameters, such as energy, and by considering these as tuneable variables 
playing a vital role in the adaptation mechanisms. The authors briefly introduce most 
commonly used frameworks for multi-criteria optimization and evaluate these under 
different “energy vs. quality of anything (QoX)” sample scenarios. Finally, the current 
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status of related network management tools is described in order to identify possible 
application areas. 

The next chapter “Traffic Management for Cloud Federation” provides a survey on 
architectures for cloud federation and describes corresponding standardization activi- 
ties, before proposing a comprehensive five-level model for traffic management for 
cloud federations, providing specific methods and algorithms at each level. The 
effectiveness of the proposed solutions is verified by using simulation and analytical 
methods. A specialized simulator for testing cloud-federation solutions within an IoT 
environment is described at the end of the chapter. 

By arguing that most of the distributed systems simulators are either too detailed or 
not extensible enough to support the modelled IoT devices, and hence problematic to 
apply in the newly emerging IoT domain, the chapter “Efficient Simulation of IoT 
Cloud Use Cases” shows how generic IoT sensors could be modelled in a state- 
of-the-art simulator using a derived generalized IoT use case. A validation of the 
applicability of the introduced IoT extension with fitness and meteorological use cases 
completes the chapter. 

Considering the IoT as one of the main building blocks of the future IoS, the next 
chapter “Security of Internet of Things for a Reliable Internet of Services” shifts the 
focus on the security of IoT, which could successfully contribute to achieving a highly 
reliable IoS by preventing, detecting, or mitigating autonomously attacks against it. The 
authors review the characteristics of IoT environments, cryptography-based security 
mechanisms and (distributed) denial of service (D/DoS) attacks targeting IoT networks. 
Moreover, they extensively analyze the intrusion detection and mitigation mechanisms 
proposed for IoT and evaluate these from various points of view. Open research issues 
for more reliable and available IoT and IoS are discussed at the end of the chapter. 

The final chapter “TCP Performance over Current Cellular Access: A Compre- 
hensive Analysis” moves from the area of services into the area of underlying com- 
munication protocols. More specifically, it treats unresolved questions and problems 
regarding the interaction between the transmission control protocol (TCP) and mobile 
broadband technologies such as the long-term evolution (LTE). To this end, the chapter 
collects the behavior of distinct TCP implementations (both loss-based and delay- 
based) under various network conditions in different LTE deployments and compares 
them in terms of the achieved throughput and utilization of radio resources. 

The book editors wish to thank all reviewers for their excellent and rigorous 
reviewing work, as well as their responsiveness during the critical stages to consolidate 
the contributions provided by the authors. We are most grateful to all authors who have 
entrusted their excellent work, the fruits of many years’ research in each case, to us and 
for their patience and continued demanding revision work in response to reviewers’ 
feedback. We also thank them for adjusting their chapters to the specific book template 
and style requirements, completing all the bureaucratic but necessary paperwork, and 
meeting all the publishing deadlines. 
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Abstract. The explosive growth of the Internet has fundamentally changed the 
global society. The emergence of concepts like service-oriented architecture 
(SOA), Software as a Service (SaaS), Platform as a Service (PaaS), Infrastruc- 
ture as a Service (IaaS), Network as a Service (NaaS) and Cloud Computing in 
general has catalyzed the migration from the information-oriented Internet into 
an Internet of Services (IoS). This has opened up virtually unbounded possi- 
bilities for the creation of new and innovative services that facilitate business 
processes and improve the quality of life. However, this also calls for new 
approaches to ensuring quality and reliability of these services. The goal of this 
book chapter is to first analyze the state-of-the-art in the area of autonomous 
control for a reliable IoS and then to identify the main research challenges within 
it. A general background and high-level description of the current state of 
knowledge is presented. Then, for each of the three subareas, namely the 
autonomous management and real-time control, methods and tools for moni- 
toring and service prediction, and smart pricing and competition in 
multi-domain systems, a brief general introduction and background are pre- 
sented, and a list of key research challenges is formulated. 


Keywords: Internet of Services (IoS) - Autonomous control 
Autonomous management - Service monitoring - Service prediction 
Smart pricing 


1 Introduction 


Today, we are witnessing a paradigm shift from the traditional information-oriented 
Internet into an Internet of Services (IoS). This transition opens up virtually unbounded 
possibilities for creating and deploying new services. Eventually, the Information and 
Communication Technologies (ICT) landscape will migrate into a global system where 
new services are essentially large-scale service chains, combining and integrating the 
functionality of (possibly huge) numbers of other services offered by third parties, 
including cloud services. At the same time, as our modern society is becoming more 
and more dependent on ICT, these developments raise the need for effective means to 
ensure quality and reliability of the services running in such a complex environment. 
Motivated by this, the EU COST Action IC1304 “Autonomous Control for a 
Reliable Internet of Services (ACROSS)” has been established to create a European 
network of experts, from both academia and industry, aiming at the development of 
autonomous control methods and algorithms for a reliable and quality-aware IoS. 
The goal of this chapter is to identify the main scientific challenges faced during the 
course of the COST Action ACROSS. To this end, a general background and a 
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high-level description of the current state of knowledge are first provided. Then, for 
each of the Action’s three working groups (WGs), a brief introduction and background 
information are provided, followed by a list of key research topics pursued during the 
Action’s lifetime, along with their short description. 


2 General Background and Current State of Knowledge 


The explosive growth of the Internet has fundamentally changed the global society. 
The emergence of concepts like service-oriented architecture (SOA), Software as a 
Service (SaaS), Platform as a Service (PaaS), Infrastructure as a Service (IaaS) and 
Cloud Computing has catalyzed the migration from the information-oriented Internet 
into an IoS. Together with the Network as a Service (NaaS) concept, enabled through 
emerging network softwarization techniques (like SDN and NFV), this has opened up 
virtually unbounded possibilities for the creation of new and innovative services that 
facilitate business processes and improve the quality of life. As a consequence, modern 
societies and economies are and will become even more heavily dependent on ICT. 
Failures and outages of ICT-based services (e.g., financial transactions, Web-shopping, 
governmental services, generation and distribution of sustainable energy) may cause 
economic damage and affect people’s trust in ICT. Therefore, providing reliable and 
robust ICT services (resistant against system failures, cyber-attacks, high-load and 
overload situations, flash crowds, etc.) is crucial for our economy at large. Moreover, in 
the competitive markets of ICT service offerings, it is of great importance for service 
providers to be able to realize short time-to-market and to deliver services at sharp 
price-quality ratios. These observations make the societal and economic importance of 
reliable Internet services evident. 

A fundamental characteristic of the IoS is that services combine and integrate 
functionalities of other services. This has led to complex service chains with possibly 
even hundreds of services offered by different third parties, each with their own business 
incentives. In current practice, service quality of composite services is usually controlled 
on an ad-hoc basis, while the consequences of failures in service chains are not well 
understood. The problem is that, although such an approach might work for small 
service chains, this will become useless for future complex global-scale service chains. 

Over the past few years, significant research has been devoted to controlling 
Quality of Service (QoS) and Quality of Experience (QoE) for IoS. To this end, much 
progress has been made at the functional layer of QoS-architectures and frameworks, 
and system development for the IoS. However, relatively little attention has been paid 
to the development, evaluation and optimization of algorithms for autonomous control 
that can deal with the growing scale and complexity of the involved service chains. In 
this context, the main goal of the COST Action ACROSS was to bring the 
state-of-the-art on autonomous control to the next level by developing quantitative 
methods and algorithms for autonomous control for a reliable IoS. 

In the area of quantitative control methods the main focus has been on ‘traditional’ 
controls for QoS provisioning at the network layer and lower layers. In this context, it 
is important to note that control methods for the IoS also operate at the higher protocol 
layers and typically involve a multitude of administrative domains. As such, these 
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control methods — and their effectiveness — are fundamentally different from the tra- 
ditional control methods, posing fundamentally new challenges. For example, for 
composite service chains the main challenges are methods for dynamic re-composition, 
to prevent or mitigate the propagation of failures through the service chains, and 
methods for overload control at the service level. 

Another challenging factor in quality provisioning in the IoS is its highly dynamic 
nature, imposing a high degree of uncertainty in many respects (e.g., in terms of 
number and diversity of the service offerings, the system load of services suddenly 
jumping to temporary overload, demand for cloud resources, etc.). This raises the 
urgent need for online control methods with self-learning capabilities that quickly adapt 
to — or even anticipate — changing circumstances [9]. 

The COST Action ACROSS has brought the state-of-the-art in the area of auton- 
omous quality-based control in the IoS to the next level by developing efficient 
methods and algorithms that enable network and service providers to fully exploit the 
enormous possibilities of the IoS. This required conducting a research in the following 
important sub-areas: 


1. Autonomous management and real-time control; 
2. Methods and tools for monitoring and service prediction; 
3. Smart pricing and competition in multi-domain systems. 


These sub-areas were respectively covered by the three ACROSS working groups — 
WGI, WG2 and WG3. In the following sections, scientific challenges faced in the 
context of each of these three working groups are elaborated. 


3 Autonomous Management and Real-Time Control 


On a fundamental level, the working group WG1, associated with this research 
sub-area, was primarily concerned with the management and control of networks, 
services, applications, and compositions of services or applications. Of particular 
interest were management and control techniques that span multiple levels, e.g., the 
network and service level. 


3.1 Introduction and Background 


To deliver reliable services in the IoS, service providers need to implement control 
mechanisms, ranging from simplistic to highly advanced. Typical questions are the 
following: 


e How can one realize the efficient use of control methods by properly setting 
parameter values and decision thresholds? 

e How can one effectively use these mechanisms depending on the specific context of 
a user (e.g., in terms of user’s location, the user’s role, operational settings or 
experienced quality)? 

e How do control methods implemented by multiple providers interact? 
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e How does the interaction between multiple control methods affect their 
effectiveness? 
What about stability? 
How to resolve conflicts? 


Ideally, control mechanisms would be fully distributed and based on (experienced) 
quality. However, some level of centralized coordination among different autonomous 
control mechanisms may be needed. In this context, a major challenge is to achieve a 
proper trade-off between fully distributed control (having higher flexibility and 
robustness/resilience) and more centralized control (leading to better performance 
under ‘normal’ conditions). This will lead to hybrid approaches, aiming to combine 
‘the best of two worlds’. 


3.2 Control Issues in Emerging Softwarized Networks 


As part of the current cloud computing trend, the concept of cloud networking [63] has 
emerged. Cloud networking complements the cloud computing concept by enabling 
and executing network features and functions in a cloud computing environment. The 
supplement of computing capabilities to networks outlines elegantly the notion of 
“softwarization of networks”. The added computing capabilities are typically general 
processing resources, e.g. off-the-shelf servers, which can be used for satisfying 
computing requirements, i.e. at the application layer (e.g. for the re-coding of videos) or 
at the network layer (e.g. for the computation of routes). Hence, features and functions 
in the network-oriented layers are moved away from hardware implementations into 
software where appropriate, what is lately being termed as network function virtual- 
ization (NFV) [24]. 

The Software-Defined Networking (SDN) paradigm [31] emerged as a solution to 
the limitations of the monolithic architecture of conventional network devices. By 
decoupling the system that makes decisions about where traffic is sent (the control 
plane) from the underlying systems that forward traffic to the selected destination (the 
data plane), SDN allows network administrators to manage network services through 
the abstraction of a lower level and more fine-grained functionality. Hence, SDN and 
the softwarization of networks (NFV) stand for a “new and fine-grained split of net- 
work functions and their location of execution”. Issues related to the distribution and 
coordination of software-based network functionality controlling the new simplified 
hardware (or virtualized) network devices formed a major research issue within 
ACROSS. 


3.3 Scalable QoS-Aware Service Composition Using Hybrid 
Optimization Methods 


Automated or semi-automated QoS-aware service composition is one of the most 
prevalent research areas in the services research community [25, 56, 85]. In QoS-aware 
composition, a service composition (or business process, or scientific workflow) is 
considered as an abstract graph of activities that need to be executed. Concrete services 
can be used to implement specific activities in the graph. Typically, it is assumed that 
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there are multiple functionally identical services with differing QoS available to 
implement each activity in the abstract composition. The instantiation problem is then 
to find the combination of services to use for each activity so that the overall QoS 
(based on one or more QoS metrics) is optimal, for instance, to minimize the QoS 
metric “response time” given a specific budget. The instantiation problem can be 
reduced to a minimization problem, and it is known to be NP-complete. 

Traditionally, QoS-aware service composition has been done using deterministic 
methods (e.g., simplex) for small service compositions, or a wide array of heuristics for 
large-scale problem instances (e.g., genetic algorithms, simulated annealing, and var- 
ious custom implementations). However, the advent of cloud services and SDNs, 
service brokers, as well as the generally increasing size of service compositions require 
new hybrid methods, which combine locally optimal solutions on various levels (e.g., 
the network, application, or service broker level). It is yet unclear how such opti- 
mizations on various levels, conducted by various separate entities, can be optimally 
performed and coordinated, and how stability of such systems can be ensured. How- 
ever, one promising approach is the utilization of nature-inspired composition tech- 
niques, for instance, the chemical programming metaphor [25, 58]. 


3.4 Efficient Use of Cloud Federation and Cloud Bursting Concepts 


One of the challenges of current cloud computing systems is the efficient use of 
multiple cloud services or cloud providers. On the network level, this includes the idea 
of virtual network infrastructures (VNIs), c.f. [44]. The VNI concept assumes 
exploitation of network resources offered by different network providers and their 
composition into a common, coherent communication infrastructure supporting dis- 
tributed cloud federation [17]. Controlling, managing, and monitoring network 
resources would allow cloud federations to implement various new features that could: 
(1) optimize traffic between sites, services, and users; (2) provide isolation for the 
whole clouds or even for particular users, e.g. who require deployment of their own 
protocols over the network layer; (3) simplify the process of extending and integrating 
cloud providers and network providers into a federation with reduced efforts and costs. 

On the service and application level, the idea of cloud bursting has been proposed as 
a way to efficiently use multiple cloud services [32, 57]. In cloud bursting, applications 
or services are typically running in a private cloud setup, until an external event (e.g., a 
significant load spike that cannot be covered by internal resources) forces the application 
to “burst” and move either the entire application or parts of it to a public cloud service. 
While this model has clear commercial advantages, its concrete realization is still dif- 
ficult, as the cloud bursting requires intelligent control and management mechanisms for 
predicting the load, for deciding which applications or services to burst, and for tech- 
nically implementing a seamless migration. Additionally, the increased network latency 
is often a current practical problem in cloud bursting scenarios. 


3.5 Energy-Aware Network and Service Control 


Traditionally, the optimization of ICT service provision made use of network perfor- 
mance related characteristics or key performance indicators (KPI) as basic inputs for 
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control and actuation loops. Those initially simple and technical-only parameters 
evolved later to more complex QoE related aspects, leading to multi-variate opti- 
mization problems. New control and actuation loops then involved several parameters 
to be handled in a joint manner due to the different trade-offs and interdependencies 
among input and output indicators. This is usually done by composing the effects 
through a simplified utility function. Therefore, resulting approaches have put the focus 
particularly in the reward (in terms of users’ satisfaction) to be achieved by efficiently 
using the available network resources (c.f. [75]). 

Meanwhile, the cost of doing so was most of the times faced as a constraint of the 
mathematical problem and considered again technical resources only. However, in the 
wake of “green ICT” and, more generally speaking, the requirement of economically 
sustainable and profitable service provision entail new research challenges where the 
cost of service provisioning must also consider energy consumption and price (c.f. 
[74]). The resulting energy- and price-aware control loops demand intensive research, 
as the underlying multi-objective optimization problem as well as the complexity of 
utility functions (c.f. [60]) and the mechanisms for articulation of preferences exceed 
current common practices. 

Such constraints affect not only the network but also the whole ICT service pro- 
vision chain. For example, server farms are vital components in cloud computing and 
advanced multi-server queueing models that include features essential for character- 
izing scheduling performance as well as energy efficiency need to be developed. Recent 
results in this area include [29, 40, 41] and analyze fundamental structural properties of 
policies that optimize the performance-energy trade-off. On the other hand, several 
works exist [20, 67] that employ energy-driven Markov Decision Process (MDP) so- 
lutions. In addition, the use of energy-aware multi-path TCP in heterogeneous networks 
({15, 21]) has become challenging. 


3.6 Developments in Transport Control Protocols 


Transport protocols, particularly TCP and related protocols, are subject to continuous 
evolution for at least two reasons besides the omnipresent, general desire to improve. 
The first reason is a need to keep up with the development of internet infrastructure 
with, e.g., reduced memory costs, widespread fibre deployment and high speed cellular 
technologies which enable larger buffers, higher bit rates and/or more variable chan- 
nels. The second reason is the increasing competition between providers of internet 
based services which drives various efforts to keep ahead of the competition in terms of 
user experience. The results are new versions of the TCP congestion control algorithm 
as well as new protocols to replace TCP. 

The work on new TCP congestion control algorithms includes work on adapting to 
the changing characteristics of the internet such as the higher and more variable 
bandwidths offered by cellular accesses [1, 11, 34, 50, 52, 80, 84], possibly using cross 
layer approaches [6, 10, 59, 61], but also simple tuning of existing TCP such as 
increasing the size of the initial window [13, 16, 66, 82]. 
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The efforts to replace TCP include QUIC (Quick UDP Internet Connection, a 
protocol from Google) and SPUD (Session Protocol for User Datagrams, an IETF 
initiative) and its successor PLUS (Path Layer UDP Substrate, also an IETF initiative), 
c.f. [76]. The QUIC protocol aims at reducing latencies by combining connection 
establishment (three way handshake in TCP) with encryption key exchange (presently a 
second transaction); it also includes the possibility of completely eliminating key 
exchange if cached keys are available and can be reused. Another key feature is the 
built-in support for HTML/2 such that multiple objects can be multiplexed over the 
same stream [18, 72]. The purpose of SPUD/PLUS is to offer and end-to-end transport 
protocol based on UDP with support for direct communication with middleboxes (e.g., 
firewalls). The rationale for this is the difficulties with developing TCP that follow from 
the fact that present middleboxes rely on implicit interpretations of TCP, and/or lack of 
encryption to perform different forms of functionality some of which even may be 
unwanted. Examples of such implicit interpretations include TCP packets with SYN 
and ACK flags being interpreted by gateways as confirmations of NAT (network 
address translation) settings and by firewalls as confirmations of user acceptance [23, 
49]. Examples of possibly unwanted functionality include traffic management devices 
aborting flows by manipulating the RST flag in TCP packets [22]. 

New versions of TCP or new DIY (do-it-yourself) protocols open a world of threats 
and opportunities. The threats range from unfair competition [18, 72] to the risk of 
congestion collapse as content providers develop more and more aggressive protocols 
and deploy faster and faster accesses in an attempt to improve their service [13, 66, 82]. 
But it also includes the inability to cache popular objects near users or prioritize 
between flows on congested access links as a result of the tendency to paint all traffic 
“grey”, i.e. to encrypt even trivial things like public information (cf. Section 4.6). As 
for opportunities, TCP clearly has some performance problems and is a part of the 
ossification of the Internet. A (set of) new protocol(s) could circumvent the issues 
related to TCP and be adapted to present networks and content, and therefore provide 
potentially better performance. 

The goal of the work on transport protocols in this context is, primary, to evaluate 
existing transport protocols and, secondary, to present new congestion control algo- 
rithms and/or new transport protocols that work better than present TCP, and at the 
same time compete with well behaved, legacy TCP in a fair way. 


4 Methods and Tools for Monitoring and Service Prediction 


Methods and tools for monitoring and service prediction was the main topic of WG2, 
mostly considered in the context of a larger system that needs to be (autonomously) 
controlled. 


4.1 Introduction and Background 


A crucial element for autonomous control in the IoS is monitoring and service pre- 
diction. For autonomous real-time (user-perceived) QoS and QoE in large, dynamic, 
complex multi-domain environments like the IoS, there is a great need for scalable, 
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non-intrusive monitoring and measurement of service demands, service performance, 
and resource usage. Additional constraints regarding, for instance, privacy and integrity 
further complicate the challenges for monitoring and measurement. In addition, 
proactive service adaptation capabilities are rapidly becoming increasingly important 
for service-oriented systems like IoS. In this context, there is a need for online quality 
prediction methods in combination with self-adaptation capabilities (e.g., service 
re-composition). Service performance monitoring capabilities are also important for the 
assessment of Service Level Agreement (SLA) conformance, and moreover, to provide 
accurate billing information. In general, the metrics to monitor rely on the point of view 
adopted. For instance, cloud providers need metrics to monitor SLA conformance and 
manage the cloud whereas composite service provider have to monitor multiple SLAs 
which is also different than what is required to be monitored for customers and service 
consumers. 


4.2 How to Define ‘QoS’ and ‘QoE’, and What to Measure? 


A common definition of QoE is provided in [55]: “QoE is the degree of delight or 
annoyance of the user of an application or service. It results from the fulfillment of his 
or her expectations with respect to the utility and/or enjoyment of the application or 
service in the light of the user’s personality and current state.” In contrast, the ITU-T 
Rec. P.10 defines QoE as “the overall acceptability of an application or service, as 
perceived subjectively by the end user’. The definition in [55] advances the ITU-T 
definition by going beyond merely binary acceptability and by emphasizing the 
importance of both, pragmatic (utility) and hedonic (enjoyment) aspects of quality 
judgment formation. The difference to the definition of QoS by the ITU-T Rec. E.800 is 
significant: “/the] totality of characteristics of a telecommunications service that bear 
on its ability to satisfy stated and implied needs of the user of the service”. Factors 
important for QoE like context of usage and user characteristics are not comprehensibly 
addressed by QoS. 

As a common denominator, four different categories of QoE influence factors 
[37, 55] are distinguished, which are the influence factors on the context, user, system, 
and content level (Fig. 1). The context level considers aspects like the environment 
where the user is consuming the service, the social and cultural background, or the 
purpose of using the service like time killing or information retrieval. The user level 
includes psychological factors like expectations of the user, memory and recency 
effects, or the usage history of the application. The technical influence factors are 
abstracted on the system level. They cover influences of the transmission network, the 
devices and screens, but also of the implementation of the application itself like video 
buffering strategies. The content level addresses, for instance on the example of video 
delivery, the video codec, format, resolution, but also the duration, contents, and type 
of the video and its motion patterns. 
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Context Content 


Fig. 1. Different categories of QoE influence factors 


4.3 QoE and QoS Monitoring for Cloud Services 


The specific challenges of QoE management for cloud services are discussed in detail 
in [38]. Cloud technologies are used for the provision of a whole spectrum of new and 
also traditional services. As users’ experiences are typically application- and 
service-dependent, the generality of the services can be considered a big challenge in 
QoE monitoring of cloud services. Nevertheless, generic methods would be needed as 
tailoring of models for each and every application is not feasible in practice. Another 
challenge is brought up by multitude of service access methods. Nowadays, people use 
variety of different devices and applications to access the services from within many 
kinds of contexts (e.g. different social situations, physical locations, etc.). 

Traditional services that have been moved to clouds can continue using the proven 
existing QoE metrics. However, new QoS metrics related to the new kind of resources 
and their management (e.g. virtualization techniques, distributed processing and stor- 
age) and how they contribute to QoE require gaining new understanding. On the other 
hand, the new kind of services enabled by the cloud technologies (e.g. storage and 
collaboration) call for research regarding not only QoS-to-QoE mapping, but also the 
fundamentals on how users perceive these services. In addition to this, the much 
discussed security, privacy, and cost need to be considered inside the QoE topic. 


4.4 QoE and Context-Aware Monitoring 


Today’s consumer Internet traffic is transmitted on a best effort basis without taking 
into account any quality requirements. QoE management aims at satisfying the 
demands of applications and users in the network by efficiently utilizing existing 
resources. Therefore, QOE management requires an information exchange between the 
application and the network, and proper monitoring approaches. There are three basic 
research steps in the QoE management: (1) QoE modeling; (2) QoE monitoring; and 
(3) QoE optimizing. 

As a result of the QoE modeling process, QoE-relevant parameters are identified 
which have to be monitored accordingly. In general, monitoring includes the collection 
of information such as: (1) the network environment (e.g., fixed or wireless); (2) the 
network conditions (e.g., available bandwidth, packet loss, etc.); (3) terminal 
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capabilities (e.g., CPU, memory, display resolution), (4) service- and 
application-specific information (e.g., video bit rate, encoding, content genre) [26, 69]. 
But also monitoring at the application layer may be important. For instance, QoE 
monitoring for YouTube requires monitoring or estimating the video buffer status in 
order to recognize or predict when stalling occurs. 

The QoE monitoring can either be performed: (1) at the end user or terminal level; 
(2) within the network; or (3) by a combination thereof. While the monitoring within 
the network can be done by the provider itself for a fast reaction to QoE degradation, it 
requires mapping functions between network QoS and QoE. When taking into account 
application-specific parameters additional infrastructure like deep packet inspection 
(DPI) may be required to derive and estimate these parameters within the network. 
A better view on user perceived quality is achieved by monitoring at the end user level. 
However, additional challenges arise, e.g., how to feed QoE information back to the 
provider for adapting and controlling QoE. In addition, trust and integrity issues are 
critical as users may cheat to get better performance [68]. 

Going beyond QoE management, additional information may be exploited to opti- 
mize the services on a system level, e.g. allocation and utilization of system resources, 
resilience of services, but also the user perceived quality. While QoE management 
mainly targets the optimization of current service delivery and currently running appli- 
cations, the exploitation of context information by network operators may lead to a more 
sophisticated traffic management, a reduction of the traffic load on inter-domain links, 
and a reduction of the operating costs for the Internet service providers (ISPs). 

Context monitoring aims at getting information about the current system situation 
from a holistic point of view. Such information is helpful for control decisions. For 
example, the popularity of video requests may be monitored, events may be foreseen 
(like soccer matches) which allow to better control service and allocate resources. This 
information may stem from different sources like social networks (useful for figuring 
out the popularity of videos and deciding about caching/bandwidth demands) but also 
can be monitored on the fly. Thus, context monitoring includes aspects beyond QoE 
monitoring (Fig. 2) [39]. Context monitoring increases QoS and QoE (due to man- 
agement of individual flows/users). But it may also improve the resilience of services 
(due to broad information about the network “status”) [64]. 


Context 
monitoring 


QoE 


monitoring 
e.g. device 


capabilities e.g. video 
buffer status 


e.g. available 
resources 


e.g. predicted e808 


traffic demands 


e.g. user 
expectations 


Fig. 2. Relation between QoE monitoring and context monitoring 
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Context monitoring requires models, metrics, and approaches which capture the 
conditions/state of the system (network infrastructure, up to and including the service 
layer), but also application/service demands and the capabilities of the end-user device. 
The challenges here are the following: (1) identification of relevant context information 
required for QoE but also reliable services; (2) quantification of QoE, based on relevant 
QoS and context information; and (3) monitoring architecture and concept. 


4.5 Inclusion of Human Factors 


Inevitably, Internet applications and services on a growing scale assist us in our daily life 
situations, fulfilling our needs for leisure, entertainment, communication or information. 
However, on one hand, user acceptance of an existing Internet service/application 
depends on the variety of human factors influencing its perception, and, on the other 
hand, there are many human factors and needs, which could be supported by the Internet 
services and computing at large, yet unknown to date. However, despite the importance 
of understanding of the human factors in computing, a sound methodology for evalu- 
ation of these factors and delineation of new ones, as well as reliable methods to design 
new Internet services with these factors in mind, do not exist. 

This challenge goes beyond the QoE/QoS challenge presented in the previous 
subsection relating to the user experience with respect to an existing and used system. 
The challenge presented here relates to identification of the unmet (implicit) needs of 
the user enabling future provision of novel and useful services. These human factors 
may relate to some specific phenomena ranging from, for example, the most preferred 
interaction style with a service (e.g., auditory, kinesthetic, visual) in a given context, 
via the user’s specific health and care needs (e.g., wellness or anti-ageing), to the user’s 
specific factors like cognitive load, physical flexibility, or momentary perception of 
safety, or intimacy in a specific context [33, 42, 54] 

In this challenge, one aims to provide a set of rigorous interdisciplinary, i.e., 
mixed-methods based methodological steps to be taken aiming to quantify human 
factors in computing within the user’s natural environments and different contexts of 
service usage [78]. The methodology incorporates qualitative and quantitative methods 
and involves real users in their real life environments through: 


e Gathering the cumulative users’ opinion via open-ended interviews and surveys. 
Thus, specifically focusing on understanding the users’ expectations towards a 
researched phenomenon and current experience of this phenomenon, mostly to 
establish the users’ baseline experience on the experiment variables and context, but 
also to gather general demographics about the experiment participants. 

e Gathering the momentary users’ opinion upon some specific factors like health 
behaviors, moods, feelings, social interactions, or environmental and contextual 
conditions via the Experience Sampling Method (ESM). Special momentary sur- 
veys executed multiple times per day ‘in situ’, i.e., in the natural users’ environ- 
ments [79]. 

e Gathering the episodic users’ opinion upon some specific factors (as above) along 
semi-structured interviews based on the diary, for example by the Day Recon- 
struction Method. 
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e Gathering the data upon the users’ daily life contexts and smartphone usage via 
continuous, automatic, unobtrusive data collection on the users’ device through the 
measurements-based ‘Logger’ service. 


Secondly, in this challenge, one wish to provide guidelines on analyzing the 
relation of these factors with the design features of the computing system itself. 
Thirdly, one would like to provide guidelines for Internet services/applications lever- 
aging the human factors in their design process, assuring user’s experience (QoE) and 
thus maximizing the user acceptance for these services. 


4.6 Aggregated and Encrypted Data (‘Grey Traffic’) 


Monitoring generally assumes that it is possible to extract from the data a set of 
parameters (e.g. fields within a packet) that enables to know what data (or services) is 
travelling (resp. provided). However, there is recent tendency to paint all traffic “grey”, 
i.e. to encrypt even trivial things like public information. Even though this may appear 
to protect user privacy, in fact such obfuscation complicates or prevents monitoring, 
caching, and prioritization which could have been used to reduce costs and optimize 
user experience. Actually, it is not only the content that is being encrypted but also the 
protocol itself (i.e. only the UDP header or similar is left open). This means that, 
contrary to present TCP, one cannot even monitor a flow in terms of data and 
acknowledgments to, e.g. detect malfunctioning flows (e.g., subject to extreme losses) 
or perform local retransmission (e.g. from a proxy). Regarding content identification, 
the solution needs not necessarily be unprotected content (there are reasons related to 
content ownership, etc.) but one can imagine tags of different kinds. Then there is a 
challenge to find incentives that encourage correct labelling [12], such that traffic can 
be monitored and identified to the extent necessary to optimize networks (long term) 
and QoE (short term). 


4.7 Timing Accuracy for Network and Service Control 


A key objective of ACROSS was to ensure that the ICT infrastructure that supports 
future Internet is designed such that the quality and reliability of the services running in 
such a complex environment can be guaranteed. This is a huge challenge with many 
facets, particularly as the Internet evolves in scale and complexity. One key building 
block that is required at the heart of this evolving infrastructure is precise and verifiable 
timing. Requirements such as ‘real-time control’, ‘quality monitoring’, ‘QoS and QoE 
monitoring’, ‘SDN’ cannot easily or effectively be met without a common sense of 
precise time distributed across the full infrastructure. ‘You cannot control what you do 
not understand’ is a phrase that applies here — and you cannot understand a dynamic 
and real-time system without having precise and verifiable timing data on its perfor- 
mance. As a first step, such timing services will firstly ensure that application- and 
network performance can be precisely monitored, but secondly, and more importantly, 
will facilitate the design of better systems and infrastructures to meet the future needs. 
Unfortunately, current ICT systems do not readily support this paradigm [83]. Appli- 
cations, computers and communications systems have been developed with modules 
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and layers that optimize data processing but degrade accurate timing. State-of-the-art 
systems now use timing but only as a performance metric. To enable the predicted 
massive growth, accurate timing needs cross-disciplinary research to be integrated into 
existing and future systems. In addition, timing accuracy and security of timing ser- 
vices issues represent another critical need. In many cases, having assurance that the 
time is correct is a more difficult problem than accuracy. A number of recent initiatives 
are focusing on these challenges, c.f. [19, 43, 71, 73, 81]. 


4.8 Prediction of Performance, Quality and Reliability for Composite 
Services 


Service orientation as a paradigm which switches software production to software use 
is gaining more popularity; the number of services is growing with higher potential of 
service reuse and integration into composite services. There may be number of pos- 
sibilities to create specific composite services and they may differ in structure and 
selection of services that form this composition. The composite services are also 
characterized by functional and non-functional attributes. Here the focus is on pre- 
diction models for behavior of composite services in respect to performance, quality 
and reliability. There are many approaches to build Quality of Service (QoS) aware 
service compositions [47] and most of them are based on heuristic and meta-heuristic 
approaches. However, there is a lack of mathematical models that provide better 
understanding of underlying causes that generate particular QoS behavior. 

Regarding QoS and reliability prediction of composite services, it is well known 
that size-, fault-and failure distribution over software components in large scale com- 
plex software systems follow power law distributions [30, 36]. The knowledge of 
underlying generative models for these distributions enables developers to identify 
critical parts of such systems at early stages of development and act accordingly to 
produce higher quality and more reliable software at lower cost. Similar behavior is 
expected from large-scale service compositions. The challenge is to extend the theory 
of distribution of size, faults and failures to other attributes of services (e.g. above 
mentioned non- functional attributes) in large-scale service compositions. Identification 
of such distributions, that may contain generative properties, would enable to predict 
the behavior of composite services. 


4.9 Monitoring with SDN 


SDN is a new and promising networking paradigm [53, 62]. It consists in decoupling 
control plane from forwarding plane and offers a whole set of opportunities to monitor 
the network performance. In SDN, each node (router, switch, ...) updates a controller 
about almost any information regarding the traffic traveling at any time in the network. 
A set of patterns can be defined by the controller for the node to apply and count the 
number of packets matching this specific pattern. Basic monitoring applies on the 
well-known header fields at any communication layer. However, some NFV can be 
introduced at some nodes to perform fine monitoring on data (e.g. DPI to get specific 
info from data) and therefore to enable the controller to have full knowledge on what 
happens in the network. 


State of the Art and Research Challenges in the Area of ACROSS 15 


Of course these great opportunities provided by SDN are accompanied by a list of 
(measurement and monitoring) challenges currently researched over the world [14, 35, 46]. 
For instance, how many controllers should be deployed? Too many controllers would 
bring us back to the older architecture but on the other hand too few controllers that 
centralize a large area would induce delay in getting the information and would require 
very expensive computation power to deal with huge amount of data. In this latter case, this 
would also generate bottleneck near the area of the controller(s). 


5 Smart Pricing and Competition in Multi-domain Systems 


WG3 dealt with pricing and competition in the IoS, in particular in relation to service 
quality and reliability. 


5.1 Introduction and Background 


Service providers in the IoS could implement their own pricing mechanism, which may 
involve simple static pricing to advanced dynamic policies where prices may e.g. vary 
(even at small time scale) according to the actual demand [4]. The involvement of 
third-party and cloud services in making up a composite service in these dynamic and 
competitive environments (with all involved parties striving for maximization of their 
own profit) raises challenging questions that are new, even though one can learn from 
the past. For example, in the traditional Internet, volume-based charging schemes tend 
to be replaced by flat-fee charging schemes. In this context, typical questions are: 
(1) what are the implications of implementing different pricing mechanisms in a 
multi-domain setting? (2) how do quality levels and pricing mechanisms relate? 
(3) how can one develop smart pricing mechanisms that provide proper incentives for 
the involved parties (regarding brokering, SLA negotiation strategies, federation, etc.) 
that lead to a stable ecosystem? (4) what governing rules are needed to achieve this? 


5.2 Modeling QoS/QoE-Aware Pricing Issues 


A key challenge is to understand what are the correct digital “goods” (e.g. in the cloud, 
in a distributed setting, beyond just physical resources), and at what level of granularity 
to consider pricing and competition issues [2, 7]. An overview of some of the pricing 
issues for the cloud is given in [51]. Initial cloud services were primarily resource 
based, with different types of resources (such as compute power, storage, bandwidth), 
different types of service (service and batch) and different service types (IaaS, SaaS 
etc.). Simple fixed pricing schemes are typically used by the providers, with the large 
cloud providers forming an oligopoly and competing on price. But even in this setting, 
each of the individual component resources and services have their own QoS measures 
and SLAs, which makes specifying the QoS and QoE of an actual service used by a 
customer difficult. The landscape is also changing: different types of cloud service 
providers are emerging, as are different types of services (such as data services, data 
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analytics, automated Machine Learning), which brings additional complexity. Hence 
research is needed on the following subtopics: 


e The digital goods and services for an IoS. The challenge is to identify the funda- 
mental building blocks: for example, are they just physical or virtual resources, as 
with current IaaS/PaaS, or do they include abstract computing elements and 
capabilities? Can they include software abstractions that would enable flexible SaaS 
descriptions rather than the current, limited, application specific SaaS offering? How 
can data be included as a good? Can higher layer services, such as automated 
analytics or machine learning be specified as capabilities? A fundamental question 
for pricing is whether goods and services are multidimensional or can be thought of 
a primarily unidimensional (see [51]). 

e A QoS and QoE framework for describing services. The current state of the art in 
TaaS is for providers to specify individual resources or bundles, each with some 
QoS measure or SLA, that often is just based on availability or mean throughput. 
The customer has to decide what to purchase and assemble the different resources 
and, somehow, translate into the solution into a QoS or QoE for their usage sce- 
nario. At the other extreme, specific solutions are offered by SaaS for limited 
applications (e.g. SAP). As the service and solutions that customers need or want to 
offer to their own customers become ever richer, a framework is needed that allows 
realistic services to be described in terms of their own QoS and QoE. 

e Component specification that allows services to be built up from components. The 
challenges here are closely tied to those for QoS and QoE. The current bottom-up 
purchase and construction of services from individual components makes life easy 
for providers but difficult for customers and solution providers, who would typically 
want a top-down specification. For example, an end-customer may see their data as 
the primary resource, building services and analytics based on it, and hence want 
performance and QoS measures related to these. There is a need to be able to build 
services from different components and different providers; the challenge is how to 
achieve this. 

e Brokering, transfer charging and “exchanges” to allow for third parties, and for 
multi-provider services. Pricing models in use now are basic: they typically involve 
pay-as-you-go pricing, with discounts for bundling, and with a rudimentary reser- 
vation offering. Amazon offers a Spot market for IAAS, although the pricing 
doesn’t appear to reflect a true auction mechanism [5]. There is a need for more 
flexible pricing models to enable users with flexible workloads to balance price 
against performance, and to reflect elastic demand. Research is needed to see how 
transfer charging may encourage multi-provider services, and whether compute and 
data resources can be treated as digital commodities and traded in exchanges. 


5.3 Context-Dependent Pricing, Charging and Billing of Composite 
Services 


Pricing, charging and billing of composite services, provided to the end user by dif- 
ferent service providers, require the construction and elaboration of new mechanisms 
and techniques in order to provide the best service to the user [3, 48, 77], depending on 
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their current context!, and to enable viable business models for the service providers [8, 
28, 65]. Solving this problem requires advances in mechanism design: current eco- 
nomic theory lacks the sophistication to handle the potentially rich variety of service 
descriptions and specifications that could be delivered in the IoT and Next Generation 
Internet (NGI). 

Charging and billing (C&B) requires mechanisms to allow secure transactions, 
trusted third party (TTP) C&B [45], cross-provider payments, micropayments and to 
allow for new payment paradigms, such as peer-to-peer currencies. The TTP feature of 
the C&B entity, perhaps, will also facilitate the initial establishment of trust and 
subsequent interaction, e.g. to ensure interoperability, between different service pro- 
viders as regards the services (service components) provided by each of them. 

The pricing and C&B need to be aligned with service definition and implemen- 
tation. Hence the autonomous control aspects (ACROSS WG1) need to be inextricably 
linked to pricing, and what can be measured (ACROSS WG2). This challenge relates 
also to the services’ intelligent demand shaping (IDS) and services’ measurement, 
analytics, and profiling (MAP). 

As a specific example, service delivery and SLAs are linked to the dynamic 
monitoring of the quality of each component of the composite service, with an ability to 
dynamically replace/substitute the component(s) that is/are currently underperforming 
with another one(s), which is/are identified as working better in the current context. The 
replacement of service components must be performed transparently to the user — 
perhaps with the user only noticing improvements in the overall service quality. 


5.4 QoS and Price-Aware Selection of Cloud Service Providers 


The upraise of IaaS clouds has led to an interesting dilemma for software engineers. 
Fundamentally, the basic service offered by different providers (e.g., Amazon EC2, or, 
more recently, Google Compute Engine) is entirely interchangeable. However, 
non-functional aspects (e.g., pricing models, expected performance of acquired 
resources, stability and predictability of performance) vary considerably, not only 
between providers, but even among different data centers of the same provider. This is 
made worse by the fact that, currently, IaaS providers are notoriously vague when 
specifying details of their service (e.g., “has two virtual CPUs and medium networking 
performance”). As a consequence, cloud users are currently not able to make an 
informed decision about which cloud to adopt, and which concrete configuration (e.g., 
instance type) to use for which application. Hence, cloud users often base their most 
fundamental operations decisions on hearsay, marketing slogans, and anecdotal evi- 
dence rather than sound data. Multiple research teams worldwide have proposed tools 
to allow developers to benchmark cloud services in a more rigid way prior to 


' The context, from which the price is computed, has three main components — user context (i.e. the 
user’s location, preferences and profile(s), the user mobile device(s), etc.), network context (i.e. the 
congestion level, the current data usage pattern, the current QoS/QoE index, the cost of using a 
network, etc.), and service context (i.e. the category, type, scope, and attributes of the service being 
requested, the request time, the application initiating the request, the current QoS/QoE index of the 
service component, etc.). 
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deployment (e.g., CloudCrawler, CloudBench, or Cloud Workbench, [27, 70)]). 
However, so far, fundamental insights are missing as regards which kind of IaaS 
provider and configuration is suitable for which kind of application and workload. 


6 Conclusion 


As can be seen from this chapter, there is a high variety of research challenges in the 
area of autonomous control for a reliable Internet of Services (IoS), which of course 
cannot be covered by a single book. The following chapters deal with a subset of these, 
mainly related to service monitoring, control, management, and prediction, leaving the 
rest of challenges for another book. 
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Abstract. Whereas some application domains show a certain consensus 
on the role of system factors, human factors, and context factors, QoE 
management of multimedia systems and services is still faced with the 
challenge of identifying the key QoE influence factors. In this chapter, 
we focus on the potential of enhancing QoE management mechanisms 
by exploiting valuable context information. 

To get a good grip on the basics we first discuss a general frame- 
work for context monitoring and define context information, including 
technical, usage, social, economic, temporal, and physical factors. We 
then iterate the opportunities and challenges in involving context in QoE 
monitoring solutions, as context may be, e.g., hard to ascertain or very 
situational. 


© The Author(s) 2018 
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The benefits of including context in QoE monitoring and management 
are demonstrated through use cases involving video flash crowds as well 
as online and cloud gaming. 

Finally, we discuss potential technical realizations of context-aware 
QoE monitoring and management derived based on the SDN paradigm. 


1 Introduction 


With the ever increasing availability of media-rich, personalized, and context- 
aware services delivered over today’s networks, service and network providers are 
constantly battling to maintain a satisfied customer base. In addition, a paradigm 
shift is being witnessed in Internet service delivery, whereby we see a transition 
towards what has been referred to as an Internet of Services (IoS), envisioning 
everything on the Internet as a service [44]. Such a transition will potentially 
lead to new services being realized as large-scale service chains, combining and 
integrating the functionality of (possibly many) other services offered by third 
parties (e.g., infrastructure providers, software providers, platform providers). 
Key aspects and challenges to address will include the reliability and Quality of 
Service (QoS) delivery, which inherently relies on monitoring, quality estimation, 
and prediction mechanisms. 

In light of the very competitive market, end-user Quality of Experience (QoE) 
will be one key differentiator between providers. In order to successfully manage 
QoE, it is necessary to identify and understand the many factors that can—and 
the specific factors that actually do in a given scenario—affect user QoE. Result- 
ing QoE models dictate the parameters to be monitored and measured, with the 
ultimate goal being effective QoE optimization strategies [23]. The majority of 
QoE-based management approaches to date are primarily based on either net- 
work management (facilitated through monitoring and exerting control on access 
and core network level) or application management (e.g., adaptation of quality 
and performance on end-user and application host/cloud level) [52]. For example, 
many Web services (e.g., YouTube or Netflix), which are transparently run over 
various Internet Service Provider (ISP) networks, commonly implement QoE 
control schemes on the application layer by adapting the application to the con- 
ditions found in the network. The network and service management approaches 
of today’s networks are often designed to operate solely in the domain of a single 
stakeholder. Consequently, due to a lack of information exchange and coopera- 
tion among involved parties the effectiveness of such approaches is limited [6]. 

Going beyond QoE management, additional information may be exploited 
to optimize the services on a system level, e.g., by considering resource allo- 
cation and utilization of system resources, resilience of services, but also the 
user perceived quality. While QoE management chiefly targets the optimization 
of service delivery and currently running applications, the exploitation of con- 
text information by network operators could lead to more sophisticated traffic 
management, a reduction of the traffic load on the inter-domain links, and a 
reduction of the operating costs for the ISPs. Context monitoring in its broadest 
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sense aims to obtain as much information about the current system state as 
possible. For example, the popularity of video requests could be monitored, or 
the consequences of events (such as soccer matches) could be foreseen based on 
data collected from different sources, such as social networks (e.g., the popularity 
of videos could dictate caching/bandwidth demands). Such a holistic viewpoint 
offers the necessary input for providing enhanced service control and resource 
allocation functions, with clear potential for improving QoE. To date, there is still 
a limited understanding of the potential business models, technical realizations, 
and exploitation benefits of extending “traditional” QoE monitoring solutions 
(e.g., monitoring of network conditions, device capabilities, application specific 
parameters, and user-related factors) with context monitoring. 

The goal of this chapter is to provide further insight into the potential of 
exploiting context monitoring for QoE management, both in terms of increas- 
ing QoS and QoE (by managing individual flows and users), and also in terms 
of improving the resilience of services (by making available information about 
network state to these services). We also acknowledge the fact that today’s Inter- 
net traffic mix is extremely diverse, with traffic from numerous services fused 
together. Context monitoring is discussed from the perspective of a single service 
(or class of services), as those fundamentals are important to us. The interac- 
tions of concurrent QoE management efforts in a single network are extremely 
interesting as well but not in the scope of this paper. 

This chapter is organized as follows. Section 2 lays out a generic framework for 
context monitoring by proposing a classification scheme for context information 
and discussing the potential involvement of this information in QoE monitoring 
and management solutions. For demonstration purposes, example scenarios are 
depicted in Sect.3 illustrating the benefits of exploiting context information in 
actual use cases that can ultimately lead to overall QoE improvements. Section 4 
then discusses a technological solution path for enhancing QoE with Software 
Defined Networking (SDN) that utilizes context monitoring data. Finally, con- 
cluding remarks and future work are given in Sect. 5. 


2 Generic Framework for Context Monitoring 


A generic framework for context monitoring provides the means to utilize context 
information in order to improve a networked system. To this end, the term 
context information is defined in Sect.2.1. A classification scheme of context 
information is proposed which provides the means for systematically identifying 
useful context information for various use cases. Section 2.2 describes how to 
model QoS and QoE as input for any improvement strategy. Different approaches 
for involving context in QoE monitoring are introduced in Sect. 2.3. 


2.1 Definition and Classification of Context Information 


The term context is very broad and several definitions exist in literature. Those 
definitions vary depending on the actual system under consideration. Goals that 
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involve context could, for example, be end user QoE improvements, QoS improve- 
ments, or cost reductions. Consequently, on one hand, context is defined to be one 
of three QoE influence factors in [50]. In a similar manner, the authors of [48] define 
context as “anything that can be used to specify or clarify the meaning of an event”. 
This kind of context information may be deployed directly during measurements, 
modeling, or an improvement process of QoE, as done for mobile TV [8]. 

On the other hand, context also relates to information that allows to deter- 
mine the current state of a system. The authors of [1] define context as “any 
information that assists in determining a situation(s) related to a user, network 
or device”. Such information may be indirectly utilized in order to improve QoE. 
Additionally, it allows for direct measurements and improvements of the QoS. 
Since there is a strong relationship between QoS and QoE, e.g., as [18] notes, 
transitioning between different types of context factors is straightforward and we 
do not need to distinguish between them. The utilization of context information 
is of course strongly dependent on the actual use case. 

A context space model is proposed by [42] where situations are determined 
from context attribute values, e.g., sensor data. First, these context attribute val- 
ues are used to infer context states defined as “the current state of a user, appli- 
cation, network or device being modeled at a time instant based on the context 
attributes”. The context states are combined to determine an overall situation. 
In our view, the proposed context space model is not required to integrate con- 
text into QoE monitoring and management. The context information directly 
influences QoE, therefore the intermediate step to map the context attribute 
value to a context space is not always necessary. 

This notion of context factors as QoE influence factors is also in line 
with recent works described in [50,56]. The work in [56] considers four multi- 
dimensional spaces: Application, Resource, Context, and User space, together 
dubbed “ARCU model”. Context is thereby composed of dimensions that indi- 
cate the “situation in which a service or application is being used”. In a simi- 
lar manner, [33,50] describe context influence factors as “factors that embrace 
any situational property to describe the user’s environment in terms of physical, 
temporal, social, economic, task, and technical characteristics. These factors can 
occur on different levels of magnitude, dynamism, and patterns of occurrence, 
either separately or as typical combinations of all three levels” . 

Based on the classes of context factors that influence QoE, as defined in 
[50], we generalize the classification by additionally integrating context factors 
from a system’s point of view as well. Thus, context information may be used 
to determine the user or the system situation and may be directly or indi- 
rectly mapped to QoE and QoS. Figure1 visualizes the taxonomy of context 
factors and includes several examples as instantiations of those classes. Here, 
we divide context factors into five broad categories, with further refined—albeit 
non-exhaustive—sub-classes of context. Section 3 will depict different use cases 
that consider context factors from multiple classes. Further examples can also 
be found in [33, 48,50, 56]. 
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Fig. 1. Classification of context factors involved in QoS and QoE monitoring. Examples 
are denoted in light grey. 


2.2 Towards Context-Aware QoE Modeling and Estimation 
Strategies 


Following the overview of different categories of context factors influencing QoE, 
we discuss QoE modeling concepts and the potential of enhancing the quality 
estimation process with context awareness. 

The main aim of QoE estimation models is to mimic the quality assessment 
process performed by the end user of a corresponding service as accurately as 
possible. The main output of the model is thus an estimation of the quality 
as perceived by the end user in the specific scenarios of interest. The aim is to 
achieve a high correlation between estimated quality scores and subjective scores 
in these scenarios. Quantifiable influence factors are mapped to quality scores 
commonly expressed by Mean Opinion Score (MOS) values. To this end, the 
data obtained from subjective quality tests is required in order to find mapping 
functions that best resemble the human perception of quality. Different types of 
quality models are currently in use for service quality estimation, quality mon- 
itoring, or even service design. We refer the prospective reader to [53], which 
provides a comprehensive review of different types of quality models. Reduced 
reference and no-reference models are generally deployed for real-time service 
monitoring, as they do not require access to a complete reference signal in the 
assessment process. In principle, both reduced reference models as well as no- 
reference models can be applied at different points in the network in order to 
better quantify the impact of packet loss and other important network parame- 
ters on QoE. 

As QoE is a multidimensional concept influenced by a number of system, 
user, and context factors, it is important to keep in mind that QoE is highly 
dependent on QoS. According to [23], QoS parameters represent one of the most 
business-relevant parameters for network and service providers. The mathemat- 
ical dependency of QoE on QoS parameters at both the network and application 
level can usually be characterized by logarithmic or exponential functions as they 
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best mimic the corresponding user quality perception. Logarithmic relationships 
were studied in [47,49] and described in terms of the Weber-Fechner law [60]. 
In principle, this law traces the perceptive abilities of the human sensory sys- 
tem back to the perception of so-called “just noticeable differences” between 
two levels of a certain stimulus. For most human senses, the just noticeable dif- 
ference can be represented by a constant fraction of the original stimulus size. 
Exponential relationships can be described by the IQX hypothesis [18,24], which 
describes QoE as an appropriately parametrized negative exponential function of 
a single impairment factor (QoS parameter). A main assumption behind the IQX 
hypothesis is that a change in QoE depends on the actual level of QoE. There- 
fore, the corresponding relationship can be described by a differential equation 
with an exponential solution. Both types of relationships confirm the general 
observation that end users are rather sensitive to quality impairments as long 
as the actual quality level is good, whereas changes in network conditions have 
less impact on their quality perception when the quality level is low. However, 
they differ in terms of their basic premise. The Weber-Fechner law links the 
magnitude of the QoE change to an actual QoS level, whereas the IQX hypoth- 
esis assumes that the magnitude of change depends on the actual QoE level. 
Furthermore, the law is mostly valid when a QoS parameter of interest relates 
to a signal or application level stimulus directly perceivable by the end user 
(e.g., delay or audio distortion), while the IQX hypothesis was derived for QoS 
impairments on a network level, which are not directly perceivable by the end 
user (e.g. packet loss) [53]. 

At a general level, the process of estimating QoE involves relating network- and 
system-level Key Performance Indicators (KPIs) (e.g., delay, loss, throughput, or 
CPU consumption) to end-to-end application level Key Quality Indicators (KQIs) 
(e.g., service availability, media quality, reliability), cf. also Fig. 2. QoE estimation 
models may then be derived as a weighted combination of KQIs as 


QoE = f(wi, KQh, e Wig KQI;). 


It should be noted that the weights w, to w; differ as the KQIs have different 
levels of impact on QoE. In other words, the weight represents a strength of the 
impact of the particular KQI on QoE. 

As an example, consider a video streaming service whereby transmission 
parameters such as loss or delay result in video artifacts impacting the media 
quality, which may in turn be translated to QoE. In certain cases a QoE model 
may directly incorporate KPIs, e.g., QoE estimation modeled in terms of network 
delay or packet loss. Going beyond a “basic” view of the QoE estimation process, 
additional input to a QoE estimation model may be provided by user or context 
influence factors [49]. We refer to an “enhanced” context-aware QoE estimation 
model as potentially incorporating context data on three different levels: 


(a) Context data on system state that directly impacts KPIs and thus also 
indirectly QoEF, e.g., network load, traffic patterns, offloading support, end 
user device processing capabilities. 
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Fig. 2. Context-aware QoE estimation process, adopted with modifications from [4]. 
User perceivable KQIs are derived from KPIs and show a root cause relationship, while 
KQIs and KPIs are further mapped to QoE using QoE estimation models. Additional 
context data (typeset in bold) may be integrated into the process on three different 
levels and lead to more reliable and accurate QoE estimations. 


(b) Context data that directly impacts KQIs and in turn QoE, e.g., public events 
or flash crowds in a specific geographic region. 

(c) Context data that impacts the choice of QoE model, KQIs, and weight fac- 
tors. Based on context data certain KQIs may or may not be included in the 
QoE model. 


In practice, QoE monitoring applications usually run on top of general net- 
work monitoring systems that provide input QoS parameters together with a 
certain amount of context information. As the monitoring system is supposed 
to work in real-time, computational efficiency and data reliability represent the 
most important performance indicators. Moreover, as most of the current QoE 
monitoring systems also require application-specific parameters as their input 
to allow a reliable estimation of QoE, additional monitoring techniques, such as 
Deep Packet Inspection (DPI), are commonly implemented, but they can also 
negatively affect the system’s performance. The availability of context informa- 
tion not only provides the opportunity to use enhanced QoE models as discussed 
previously, but also provides more reliable data and improves QoE estimation 
accuracy. In other words, complex context information provided to a monitoring 
system in well-defined frequent intervals can fine-tune the reliability and accu- 
racy of QoE estimation. As QoE estimation models and context monitoring tools 
are usually part of the same monitoring system, they are aware of each other. 
If this would not be the case, system complexity would greatly increase, with 
only marginal QoE estimation improvements in terms of the reliability as well 
as accuracy to be had. Consequently, given that QoE monitoring represents a 
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Table 1. QoE monitoring insights for each context factor class. The given examples 
are non-exhaustive. 


Physical | Information source: User, device, network 
Techniques and tools: Sensors, DPI, user feedback, GPS 
information, prediction techniques 
Examples of opportunities: | Pre-allocate network resources based on 
the user’s daily life pattern, location or 
device capabilities 
Temporal | Information source: Network 
Techniques and tools: Operator databases, DPI 
Examples of opportunities: | Reconfigure the routing process based on 
historical information about the traffic 
load per hour of day 
Economic | Information source: ISP, operator 
Techniques and tools: Data repositories (subscriber databases) 
Examples of opportunities: | Differentiate gold subscribers over 
best-effort ones 
Social Information source: ISP, operator, CDN, OTT, media, social 
networks, user 
Techniques and tools: Data analytics, prediction techniques 
Examples of opportunities: | Video flash crowd 
Usage Information source: User, device 
Techniques and tools: User feedback, application monitoring, 
DPI 
Examples of opportunities: | Prioritize foreground versus background 
applications 
Technical | Information source: Device, network 
Techniques and tools: DPI, sensors, actuators, probes, embedded 
agents, network elements 
Examples of opportunities: | Control the cell selection based on the 
current device battery level 


crucial part of QOE management systems, this leads to more efficient and reliable 
adaptation strategies that specify how to change parameters at different layers 
in order to improve QoE. 

We direct the interested reader to the chapter on QoE management 
approaches—dealing specifically with QoE management in Future networks— 
and to [57,63]. 


2.3 Involving Context in QoE Monitoring 


Specifying the QoE modeling and estimation process is an essential prerequi- 
site to QoE monitoring. We have recently been witnessing a paradigm shift 
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from “traditional” QoS monitoring to QoE monitoring procedures [35]. Various 
approaches can be used for QoE monitoring, namely on end-user or device level, 
network level, or a combination of the two, see, e.g., [53]. QoOE monitoring may 
run on the network layer to capture QoS parameters, requiring the existence of 
mapping functions between QoS and QoE, but also on application layer in order 
to obtain relevant information directly from the application in question. 

Including additional context data, potentially gathered from a broad range 
of sources, can significantly enrich QoE monitoring by providing better infor- 
mation about a system’s current and future properties and state. For example, 
monitoring social networks’ text streams with natural language processing algo- 
rithms and machine learning approaches [28] can provide a valuable source of 
information regarding short-term trends in popular content or potential “high- 
profile’ future events which allows for better caching strategies to avoid stalling 
in video streaming [55]. 

While QoE-based management is already a novel procedure when compared 
to traditional QoS-based management, its performance can still be enhanced 
further with context awareness. Therefore, in order to reap all the benefits that 
stem from the possibility of context-awareness in a network, viable and fea- 
sible context monitoring mechanisms need to be devised. It is expected that 
these mechanisms will differ per use case, to better reflect the requirements and 
idiosyncrasies of each scenario (as it will be also explained in Sect. 3). However, 
often a trade-off has to be made between the amount of information collected 
and the processing time required to use them properly. 

In Table 1, we provide some insights about QoE monitoring in a high-level 
way. Specifically, following the context factor classification proposed in Sect. 2.1, 
we present the potential sources of information per class, the techniques or tools 
to extract information from the appropriate sources, and finally some new oppor- 
tunities, that will arise if context monitoring, and subsequently, context- and 
QoE-aware management are realized. Possible sources of context information are: 


(a) The end-users themselves, 

(b) The users’ device characteristics, ranging from the device’s hardware/ 
software up to the application layer, 

(c) The network, i.e., any intermediate network nodes in the core or the access 
network that are capable of providing context-related information, 

(d) The operator’s or ISP’s proprietary infrastructure, 

(e) Third party information, namely feedback from any players who do not con- 
trol the information flow, but only its content and format (e.g., content and 
“Over-The-Top” (OTT) service providers, social channels such as Facebook, 
Twitter, etc., provisioning their services over the operator’s or ISP’s infras- 
tructure). 


Regarding the acquisition of technical context factors, some insights may be 
found in [15]. 

The main concerns and challenges in implementing QoE monitoring proce- 
dures on top of the already available network mechanisms mainly have to do 
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with the complexity and signaling overhead that would be inevitably imposed in 
the network. Context awareness does not come without a cost. Thus, it is impor- 
tant to employ it only when and where it is meaningful. A crucial point is to 
avoid turning the context monitoring procedure into another negative QoE influ- 
ence factor, e.g., by introducing a higher congestion to the network, or draining 
the user device’s battery faster, or even persistently requesting user feedback. 
Therefore, the extent of the influence of each context factor on QoE needs to 
be established. A further challenge in capturing the context and consequently 
in validating QoE monitoring techniques is the fact that these studies have to 
be conducted in the field. The controlled laboratory environments, which are 
conventionally used for QoE monitoring and modeling purposes, do not allow 
for capturing the diversity of the various context factors and the plethora of use 
cases. Some best practices for acquiring context information are provided in [51], 
although the focus there is mainly on user experience studies and not QoE. 

It should be emphasized that context monitoring requires models, metrics, 
and approaches that capture the condition of the system on a network and 
service layer, application and service demands, and the capabilities of the end- 
user device. Assuming, however, that context monitoring can be realized, then 
the exploitation of context information may lead to more sophisticated system 
and traffic management, traffic load reduction on the inter-domain links, and 
reduction of operating costs for ISPs. However, before this can occur, effective 
methods need to be determined in order to incorporate the monitored context 
parameters into an enhanced cross-layer QoE management. A promising option 
might be based on SDN [31] as discussed further in Sect. 4. 


3 Context Factor Examples, Use Cases and Literature 


To demonstrate the potential benefits of context information in QoE monitoring 
and management, this section presents some practical examples, with varying 
involved parties and services. Table 2 also highlights the relevant factors for some 
scenarios. As outlined in Sect. 2.1, context is a very broad umbrella term that 
encompasses a wide range of elements, many of which are interrelated. To get a 
grasp of their influences, a general overview and related research that illustrates 
the idea of context factors is presented here for individual factors first, before 
moving on to more specific use cases. 


3.1 Usage Examples for the Context Factor Categories 


Usage Context Factors. The interactivity required for a certain task to be 
conducted in a satisfactory fashion is another usage context factor that can be 
essential in certain scenarios. Besides gaming this especially concerns conver- 
sational applications. In [7], the extent to which interactivity requirements of 
Real-time Communications (RTCs) and specifically voice applications impact 
QoE is examined. Of particular note is the ITU-T E-Model [29], a telecom- 
munications planning tool that generates a QoE score. Thus they developed a 
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Table 2. Overview of context factors and their relevancy to the usage scenarios dis- 
cussed in Sect. 3. 


Physical | Video Flash Crowd: Devices and sensors detecting crowds 
Online and Cloud Gaming: | Type of Device and Input devices 
Context-Aware Billing: Wireless Channel state (SINR, ...) 

Temporal | Video Flash Crowd: Diurnal patterns, prior knowledge of events 
Online and Cloud Gaming: | Amount of time played in session, total; 

load at time of day, week 
Context-Aware Billing: Period: peak or off peak hour 

Spatial Video Flash Crowd: Prior knowledge of crowd events 
Online and Cloud Gaming: | Stationary, on the move 
Context-Aware Billing: Ambient free Internet access 

Economic | Video Flash Crowd: Announced/advertised events/content and 

expected popularity 

Online and Cloud Gaming: | Data-center location and capacity, game 
charging model 

Context-Aware Billing: Used up quota of the user 

Social Video Flash Crowd: Popularity of content, social relationship 

of video users 

Online and Cloud Gaming: | Playing solo, cooperatively, or 
competitively 

Context-Aware Billing: Social events for device-to-device 
communication 

Usage Video Flash Crowd: Estimating popularity and crowds through 

individual usage history 
Online and Cloud Gaming: | Player’s skill level and previous 
experiences with game and genre 
Context-Aware Billing: Charging based on usage patterns 
Technical | Video Flash Crowd: Specific service implementation and ability 


to handle crowds 


Online and Cloud Gaming: 


Game features, graphic fidelity, specific 
game mechanics, genre 


Context-Aware Billing: 


Charging based on device properties 


delay impairment function that depends on the nature of the conversation. For 
example, a conversation with strong interactivity requirements is modeled by 
a curve that returns a low impairment up to the 150 ms level, then increases 
rapidly between 150 ms to 300 ms and levels out thereafter as the conversation 
has essentially become half duplex anyway. The authors in [38] also examine this 
issue, introducing the concept of “Tasks”. These tasks each describe a specific 
set of requirements, e.g., in terms of the conversation’s interactivity. Examining 
individual users’ behavioral patterns and from that extrapolating the actions 
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of a larger crowd in order to enhance, e.g., network-wide video streaming QoE 
management is also an enticing aspect of this type of context factors. 


Noise and Vibration Context Factors. Research by Wilk et al. [62] presents 
a prototype that measures smartphone background noise and screen vibration 
in real-time for users on the move. These factors are then used to decrease 
downloaded video and audio quality in order to reduce bandwidth. This strategy 
is based on the premise that delivering high quality audio and video under such 
conditions represents a waste of bandwidth and perhaps money, illustrating the 
potential inter-relatedness of context factors and economics. 


Charging and Billing Context Factors. With usage-based pricing in mobile 
networks being a reality in many countries, users are incentivised to sparingly use 
their alloted data volume. Pursuing this line of thought, context monitoring could 
open up several opportunities for interactions between the operator, the network, 
and the user in this regard. Data-cap-aware video adaptation is examined in 
[10]. The work proposes to choose the best video resolution that still enables the 
client to stay below its data cap. Another approach presented in [21] evaluates 
the case of shifting elastic traffic to off-peak periods. Alternatively, one can 
postpone delay-tolerant traffic until it can be offloaded to free-of-charge Wi-Fi 
networks. With the upcoming 5G mobile networks it could also be feasible to 
offload certain traffic using direct device-to-device communication with someone 
willing to transfer the data for free assuming sufficient resource availability [9]. 
In this case, context information can be utilized to predict when such a suitable 
connection will be available. Finally, a channel-aware pricing model could also be 
an opportunity, such as is discussed in [19]. Herein, the operator sets the prices 
to a value that fits the currently available radio and network resources. 


Mobility Context Factors. With the evolving mobile architectures, users 
increasingly expect service and application availability whilst on the move. 
Returning to the aforementioned ITU-T E-model [29] the advantage of appli- 
cation accessibility on the move is modeled by an advantage factor defined as 
“the compensation of impairment factors when the user benefits from other types 
of access”. However, this is specifically designed for voice communication and 
does not necessarily apply to the same extent to other applications such as 
online video games or non-RTC applications, thus further research is required. 
A more recent example is provided in [40]. Here, the scenario of adaptive video 
streaming on the move is highlighted, especially the challenges surrounding the 
prediction of outage events through context and appropriately modifying the 
buffering behavior to avoid playback stalls. 


Location Context Factors. Usually location relates specifically to user loca- 
tion. However, for applications that are hosted remotely in the cloud could be 
extended to include where the application is hosted, keeping in mind the relative 
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distance between host and client and the impact on latency. With the growing 
importance of cloud services this raises an interesting interdependence of spatial, 
physical, and economic context factors and can have a significant impact on the 
QoE of RTC-applications such as communication and gaming. 

A 2011 study [34] suggests a Data Center (DC) energy consumption ratio 
of 1.5% of the global production. This share may continue to rise as public and 
private cloud adoption is growing steadily. About 70% of the power usage is 
directly proportional to the server load, making efficient usage of the existing 
servers a must. The principal underlying technology, which facilitates manage- 
ment of workload in a DC, is virtualization. Rather than each server hosting 
a single application instance, a number of Virtual Machines (VMs) are running 
concurrently on a single physical server. These VMs may be migrated to a differ- 
ent local host or across the Wide Area Network (WAN) depending on a variety 
of strategies with significant impact on QoE. 

A good example is the follow-the-sun-strategy that helps minimizing the 
network latency during office hours by placing VMs close to where they are 
requested most often. Such a strategy can improve QoE, with the trade-off of 
increased energy costs as they are typically higher during daylight time. Where 
latency is not a primary concern or where other factors are given precedence, 
there are a number of different strategies which can be applied in addition. These 
generally involve VMs getting shifted to locations with cheaper resource costs, 
for example to places where less expensive cooling is required, exploiting lower 
power costs during night times (“follow-the-moon”), or following fluctuating 
electricity prices on the open market [46]. 

A final reason for monitoring spatial context factors in DC environments 
is that of data safety. Operations related to fault tolerance, mirroring, mainte- 
nance, or disaster recovery can be the cause of VM migrations. Regardless of the 
motivation, migrating VMs can greatly impact application response times and 
thus QoE. If detailed Service Level Agreements (SLAs) are negotiated and in 
place, QoS parameters can be contractually covered. However, users without an 
SLA may then suddenly find greatly increased delays during or after a migration. 


3.2 Use Case: Video Flash Crowd and QoE 


The potential of context monitoring for video streaming was assessed, e.g., in 
simulation studies in [22,25] on which this section is based. In the papers’ sce- 
nario, a flash crowd of users watching the same video is examined. This sudden 
increase of popularity is a typical phenomenon of video platforms with user- 
generated content like YouTube. Video cascades can often emerge due to new or 
popular content being spread through social media challenges. Those phenom- 
ena, dubbed flash crowds, may be temporarily limited because of event-related 
content, spatially limited because of regional interests, or socially limited due 
to grouping effects in social interests [12]. The simulation provides a model 
for the effects of flash crowds on adaptive streaming and compares different 
approaches to Content Distribution Network (CDN) load balancing and video 
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adaptation strategies to each other to study the benefits of context information 
in the approaches that support it. 

The context-unaware strategy has no knowledge of flash crowds and reacts 
too slow to such events. Hence, some users experience stalling even though 
enough capacity would be available to serve all users and stalling should be 
entirely avoidable. HTTP adaptive streaming (HAS) is necessary in order to 
adapt to the current network situation and to reduce the number of stalling 
events. If many of the users unaware of the flash crowd event request the video 
at its highest quality, most of the users will suffer from stalling, leading to a 
worse overall QoE. This topic is also covered under the term QoE fairness [26]. 

The utilization of context information by the CDN as well as by HAS 
improves QoE. It is difficult to properly configure the context-unaware load bal- 
ancing strategies as the exact values strongly depend on the actual flash crowd. 
Therefore, a proper information exchange mechanism is required to make the 
information available across layers. The earlier the flash crowd scenario is rec- 
ognized by the load balancer the better the overall system performance will be. 

The results are very sensitive to the dynamics and interactions of the HAS 
control loop and the CDN load balancing. Thus, in practice, realistic tests and 
input models are required to quantify the results and to derive reasonable config- 
urations. This serves to demonstrate the potential of exploiting relevant context 
data. In the given use case, the contextual information regarding the formation 
of a flash crowd is collected by a third party. Other studies have also addressed 
related improvements, such as the approach proposed in [37], which suggests a 
video control plane that can dynamically adapt both the CDN allocation and 
video bitrate midstream based on global context knowledge of network state, 
distribution of active clients, and CDN performance variability. 

Other video scenarios addressed in related work draw similar conclusions. 
For example, significant work has addressed the challenges arising from multi- 
ple concurrent clients accessing HAS video in a given access network, thereby 
competing for bandwidth across a shared bottleneck link. Problems arise due to 
individual clients making adaptation decisions based on local observations, hence 
clients’ adaptation behaviors interact with each other, which results in quality 
oscillations. Solutions proposed in literature involve centralized network-based 
solutions deployed using SDN [20], enhanced client-side adaptation to improve 
fairness among flows [32], and server-based traffic shaping [2]. However, in all 
these cases context information (in this case network state data including over- 
all resource availability and global resource and traffic demands) could likely be 
utilized to control quality adaptation decisions (on a domain-wide level), conse- 
quently reducing oscillations and improving QoE. 

The flash crowd scenario can illustrate the benefits of employing SDN as a 
centralized technological solution. With traditional IP networks, decisions are 
made based on local knowledge, so even if a server that conducts data mining 
on social networks could foresee a high download rate, it will be challenging to 
perform load balancing. This task would require changing the routing policy at 
the Border Gateway Protocol (BGP) level but may also have an impact on the 
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Interior Gateway Protocol (IGP) which is confined to its autonomous system. 
Even if the propagation of the routing updates would be implemented by a gossip 
algorithm, the time it takes will make the network topology change irrelevant. 
However, with SDN, this very complex and time consuming task is simplified. 
The forwarding decision is taken higher up by the central controller, so an update 
to the routing decision at the controller implements the load balancing. From 
a technical perspective, further insight is needed in regards to the information 
exchange between SDN controllers that are responsible for different domains, and 
between SDN controllers and other entities that can provide context information. 


3.3 Use Case: Online and Cloud Gaming 


Context monitoring can play a huge role for video games, specifically for online 
and cloud games. Compared to videos and watching video streams, video games 
add a high degree of user interactivity to the mix, which limits and alters the 
type of eligible evaluation and management approaches. 

In the case of cloud gaming, the game server executes the game logic, ren- 
ders the scene, and sends an encoded video stream to the client in real-time 
[27,41]. The client is responsible for decoding the video and capturing the player’s 
commands. Two network-level factors are critical: the bandwidth required by the 
video stream (influencing image quality and frame rate) [58], and the Round- 
Trip Time (RTT) (influencing the game’s responsiveness to input commands) 
[13,14]. For online games, the throughput is much less important when compared 
to the RTT. Nonetheless, the network path is influenced by numerous kinds of 
context factors, most prominently technical (type of access), spatial (user on the 
move), and economic (data center location and amount of available processing 
and GPU resources). 

Speaking of the economic factor, service providers are inclined to either use 
a centralized (to achieve a maximum multiplexing gain especially of the costly 
GPU resources) or a follow-the-moon strategy on their active server locations 
to save energy and processing costs. However choosing a too-distant data-center 
location would introduce additional latency to the system, negatively impacting 
the player’s experience. The amount of concurrent players can be deduced by a 
diurnal temporal context factor. Proper trade-offs between these three context 
factors (namely economic, spatial, and temporal) have to be determined. 

Besides network aspects, the type of device used for the gaming client and its 
input methods play a significant role in evaluating cloud gaming. The device’s 
native resolution determines the game’s optimal rendering quality setting and 
the capture resolution as well as the required bandwidth to transmit the video 
stream. Likewise, but maybe not as obvious, the available input methods deter- 
mine the effect size latency has on the game. This especially concerns touch 
controls, which typically lack in accuracy as well as immediacy that can be 
especially felt in fast-paced games such as first-person-shooters. 

This argument can additionally be extended to consider the type and genre 
of a game as a further context factor. The range of interactivity in games is very 
large and diverse, sometimes even in a single game. While some games require 
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decision-making and inputs on a millisecond-scale (e.g., first person shooters 
like Counter-Strike: Global Offensive or DOOM), some games might require 
hundreds of input decisions every minute (StarCraft II), while others require 
only a single input every few seconds or even minutes with very wide timing 
windows (take turn-based strategy games or adventure games with logic puzzles 
such as the classic Monkey Island series). This technical context factor repre- 
sents itself well in the way QoE management can be conducted in online and 
Cloud games. If the specific characteristics of a game are known in advance, the 
timing constraints can be determined without further measurements and put to 
use for management of the transmission. Also, if a game with an online com- 
ponent is played through a cloud gaming service, the game’s lag concealment 
mechanisms (that are in place for almost any kind of online multiplayer game) 
could additionally encompass the cloud gaming service and provide telemetry 
on the streaming latency to the online game’s server in an effort to capture the 
actual end-to-end latency and improve the gaming experience. 

A further player-level temporal and social context factor might also be inter- 
esting for management purposes: the duration players have been playing the 
same game in one session as well as the total time. Both metrics might either 
heighten the player’s sensitivity to deviations from the expected playing expe- 
rience or might make her more tired and thus ignorant of quality degradations. 
Depending on the direction of the interaction, QoOE management can either be 
lenient or become more stringent during the course of a game session. Previ- 
ous studies have shown the link between network performance and game play 
duration [11]. Further studies have shown that player experience and skill are 
important QoE influence factors which may clearly impact a player’s tolerance to 
performance degradations [43,59]. Hence, experienced players could be treated 
differently to novice or unexperienced players by the responsible mechanism. 


4 Discussion on Technical Realization Approaches 
of Context Monitoring 


Thus far we have discussed a generic framework for context monitoring and 
provided use cases which outline the opportunities for exploiting concrete context 
data in QoE monitoring, estimation, and management solutions. In this section 
we discuss novel technical realization approaches and challenges. 

In recent years, concepts like SDN and Network Functions Virtualization 
(NFV) have become key drivers of network innovation. With their emergence, 
the importance of software in networking has grown rapidly. This trend is lead 
by open-source initiatives like the OpenDaylight! project, the Open Network 
Operating System (ONOS)? project, or the Open Platform for NFV (OPNFV)? 
project. The introduction of these technologies paves the way for new possibilities 
to control and centrally orchestrate the network in a more flexible fashion in 


1 https: //www.opendaylight.org. 
? http://onosproject.org. 
3 https: //www.opnfv.org. 
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order to achieve and maintain a high QoE standard for the users. SDN involves 
decoupling the data plane from the control plane [39] where the decision on how 
to forward a packet is taken. 

One of the goals of SDN is the creation of a network OS which abstracts the 
network complexity and allows operators to specify a network policy based on 
data analytics and other sources without having to take care of the implemen- 
tation details. Similarly, SDN can enable applications and services to express 
the required QoS expected from the network and let applications on top of the 
controller translate this algorithmically into optimized forwarding decision in 
order to satisfy QoE demands. SDN increases flexibility and efficiency by allow- 
ing the information to be aggregated at a single logically-centralized location 
(global SDN controller) to be subsequently accessed by multiple applications. 
SDN can enable a global or domain-wide view of a network and its usage, which 
includes network-level QoS and topology, user and application QoE, and context. 
Information from multiple sources can enhance the correlation and prediction 
of traffic demand. SDN allows the creation of dedicated monitoring solutions. 
The programmability of these kinds of flow-based monitoring functions sup- 
ports a fine-grained adaptivity (e.g., per user, per application) of monitoring 
and enforcement procedures. These features can enable real-time application- 
aware network resource management. Combined with new distributed cloud 
approaches, like fog or mobile edge computing, this can bring services much 
closer to the user, thereby reducing latency and improving quality. Information 
gained from context-based QoE monitoring can thus not only be used to steer 
traffic, but also to dynamically instantiate edge cloud capabilities. 

It should be noted, that SDN is, by far, not the only solution to context 
monitoring. But it might be the most prolific one in the future when speak- 
ing about operator-wide orchestrated monitoring. Other solutions might include 
more network-independent, end-to-end, application-layer, as well as cross-layer- 
focused approaches, e.g. in the fashion of [3]. While these are out of scope for 
this work, they merit their own separate investigations in the future. Moreover, 
this section should be understood of a high-level discussion of the benefits and 
obstacles of the SDN-based approach and not as an instruction manual on the 
actual implementation of context monitoring. 


4.1 Monitoring with SDN 


One of the benefits of SDN is that it facilitates data collection through standard- 
ized vendor-independent interfaces and makes the monitoring task more scalable 
by not having to place isolated monitors in multiple locations. Rather, QoS, QoE, 
and context information can be collected at a logically centralized location (for 
example the SDN controller). The interface between the controller and the net- 
work forwarding devices is generally referred to as Southbound Interface (SBI), 
enabling the controller to know what is going on in the network. Several SBI pro- 
tocols have already been proposed (e.g., “ForCES” by IETF) although the most 
famous is OpenFlow (OF). While in principle the OpenFlow protocol enables 
monitoring, its main purpose is the control of traffic flows. Therefore, usually 
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dedicated monitoring protocols such as “sFlow” are used to gather information 
for the SDN controller. 

At this stage a couple of issues already arise. These are mostly present in 
classic monitoring solutions as well, as long as they act as an intermediary 
and not on an end-to-end basis. First, application QoE measurements and con- 
text information require communication and information exchange between SDN 
controllers and end-user devices or content providers. Monitoring protocols are 
however typically designed to communicate with network elements, but for the 
envisaged scenario, metrics are also needed from user devices, such as terminal 
capabilities (e.g., CPU) or buffer state. Generally, the concept where a global 
SDN controller has relevant information from the full end-to-end infrastructure 
(core/access/end device) is tied to emerging research on time-aware applications 
and systems [61]. This initiative is investigating the opportunities and challenges 
that are presented by having precise time synchronization demands on all inter- 
connected devices. One key benefit of this approach is in the area of QoS/QoE 
whereby the SDN controller will have access to precisely timestamped network 
information. This is especially important for delay-sensitive applications, such 
as multiplayer online games or teleconferencing. 

Context information includes, amongst others, the available Wi-Fi networks, 
user and subscription information, and provider policies. For this reason, inter- 
working and communication of SDN with entities such as 3GPP’s Access Net- 
work Discovery and Selection Function (ANDSF) and Policy and Charging Rules 
Function (PCRF) can be important. The PCRF and the accompanying Policy 
and Charging Enforcement Function (PCEF), which are traffic management and 
DPI nodes located in the core network, have access to all the policy and charg- 
ing information related to subscribers. Having this knowledge, they control the 
provisioning of the QoS for each flow through the concept of bearers (a network 
tunneling concept). These two entities are therefore also a good candidate for 
controlling and enforcing QoE-based policies. A future, enhanced PCRF should 
then be able to enrich the bearer selection and management process based on 
the specific QoE requirements of each service data flow. 

Accurate SDN-based QoE monitoring requires measurements from multiple 
domains, which in turn necessitates communication between SDN controllers 
(between their “westbound” and “eastbound” interfaces). In [54] different hierar- 
chies and topologies for the deployment of controllers are proposed. The ONOS 
project [5] is a good example of efforts to tackle the challenge of providing a 
centralized coordination while avoiding performance degradation at the control 
plane. ONOS maintains a global view of the network while the SBI is physically 
distributed among multiple servers. 


4.2 QoE Management with SDN 


Overview of Existing Approaches. Initial commercial applications that 
combine QoE management with SDN already exist (e.g. for mobile nets*). This is 


t https: //www.nokia.com /en_int /news/releases/2014/09/02/nokia-networks-big- 
data-innovation-promises-dynamic-experience-management-networksperform. 
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reflected in literature as well. In [36] the SDN architecture is leveraged in order 
to introduce a “QoE-service” specifically for OTT providers. This QoE-service 
takes advantage of network QoS metrics, application QoE metrics or user feed- 
back, and by combining them with the global resource view of the network it 
enables elastic traffic steering with the objective to enhance the performance of 
OTT applications or for premium users. 

The work in [45] considers a framework where the QoE of streaming video 
is measured at the video player running at the destination node, informing the 
network of the achieved QoE. If the video QoE is low, the SDN-based network 
identifies bottlenecks on the delivery path and assigns a new video server to 
deliver the requested video to the end-node. A similar approach is taken in 
[30] on the example of YouTube video streaming. An agent at the end devices 
measures the amount of buffered video data of the video and notifies an appli- 
cation controller if it drops below a certain threshold. The application controller 
instructs an SDN controller to temporarily relocate the YouTube flows of this 
user to a less congested link. When the video recovers, the relocation is reversed 
to conserve resources. 

The work in [17] proposes an in-network QoE measurement framework. 
Unlike the previous approach, the QoE is measured inside the network and based 
on the video fidelity and representation switching. This is achieved by SDN repli- 
cating the video stream to a QoE measurement agent. The QoE measurements 
are collected by a measurement controller, which can modify the forwarding 
paths using SDN in order to improve the delivered QoE. Moreover, the measure- 
ment controller exposes an API to allow other applications to obtain information 
on the delivered QoE. The work in [16] proposes a QoE management framework 
for SDN-enabled mobile networks and consists of three modules: QoE monitor- 
ing, QoE policy and rules, and QoE enforcement through network management. 


Example Context Monitoring Solution with SDN. Previously discussed 
metrics (network and system-level KPIs, application-level KQIs, context infor- 
mation) may be accumulated at the controller and provide input to the man- 
agement plane. This information can be used by an application on top of the 
Network operating system (NOS) to perform optimization and control. This 
information is forwarded by the controller via the Northbound interface (NBI). 
There are several ways this information can be leveraged to perform efficient 
management of resources. For example, in the context of the video flash crowd 
use case presented in Sect.3, it could be used at the CDN level to make the 
CDN aware of expected congestion and perform proactive rather than reactive 
load balancing. In the latter case this means that another dedicated application 
at the CDN will change the resource allocation through a dedicated controller. 

Assuming some decision is taken, this decision has to be forwarded to the 
controller via an NBI, and finally from the controller to the device via the same 
SBI that has been used before. As such, both the NBI and SBI are used for 
monitoring in the uplink as well as management in the downlink. In Fig.3 we 
illustrate a use case, namely a video service provider communicating with the 
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Fig. 3. An example of performing context monitoring in an SDN environment, adapted 
from [31] and extended to illustrate a context monitoring scenario. Other scenarios 
may also utilize APIs bound to different directions, e.g., context monitoring can also 
be conducted through the NBI. 


NOS (controllers) via NBI which in turn gather the monitored data or provide 
the rules (i.e., how to handle the traffic) to the involved virtual and physical 
network elements. 


5 Conclusions 


This paper has provided detailed insights into how context monitoring can be 
beneficial for QoE management with the goal to improve both QoS and QoE. A 
context information classification scheme has been proposed to serve as a generic 
framework, which has further been applied to different context monitoring use 
cases with various types of services involved. The use case scenarios were selected 
to demonstrate the significance of context factors and the ways in which they 
can interrelate. The video streaming use case illustrates the benefits of having 
access to context data prior and during the formation of a flash crowd, and 
how such an event could be combated with CDN load balancing strategies. The 
benefits and potential exploitation of context data are further discussed in the 
case of on-line and cloud gaming, representing a highly demanding real-time 
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interactive scenario. Further use cases have highlighted some opportunities of 
how context monitoring can be deployed to improve QoE in future wireless and 
mobile networks. 

Key issues to address with respect to deploying context-aware QoE moni- 
toring and management solutions are the technical realization challenges. We 
focus our discussions on the SDN paradigm as it can offer a promising technical 
solution. We are also aware of the potential implications to and conflicts with 
user privacy. This topic was specifically omitted here, as it very much merits a 
separate discussion. 

In the growing Internet of Services, QoOE management will play an important 
role. It can be proliferated, e.g., through SDN, especially when the variety of 
service providers, monitored metrics and SDN controllers is considered. Finally, 
underlying business models will play a key role in putting an effective QoE 
management scheme based on enhanced monitoring into practice. 
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Abstract. This chapter discusses prospects of QoE management for 
future networks and applications. After motivating QoE management, it 
first provides an introduction to the concept by discussing its origins, key 
terms and giving an overview of the most relevant existing theoretical 
frameworks. Then, recent research on promising technical approaches to 
QoE-driven management that operate across different layers of the net- 
working stack is discussed. Finally, the chapter provides conclusions and 
an outlook on the future of QoE management with a focus on those 
key enablers (including cooperation, business models and key technolo- 
gies) that are essential for ultimately turning QoE-aware network and 
application management into reality. 
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1 Introduction to QoE Management 


Understanding, monitoring and managing the provisioning of networked applica- 
tions and services is a domain that receives growing interest by academia and indus- 
try. This development is mainly a consequence of increasing competition amongst 
stakeholders in the ICT, media, and entertainment markets, the proliferation of 
resource intensive services (such as online video and virtual reality movie stream- 
ing) and the ever-present risk of customer churn caused by inadequate service qual- 
ity. Furthermore, the foreseen paradigm shift towards an Internet of Services (IoS) 
© The Author(s) 2018 
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will lead to systems where new applications and services are based on flexibly con- 
figurable large-scale service chains, which depend on high levels of flexibility, qual- 
ity, and reliability [20]. 

These trends create conflicting demands, particularly on the network operators 
and service providers involved. On the one hand, they need to offer sophisticated 
high-performance infrastructures and services that enable affordable high quality 
experiences that lead to customer satisfaction and loyalty [48]. On the other hand, 
they have to operate on a profitable basis in order to remain economically viable 
in the long run. 

In this context, Network and Application Management (NAM) has the poten- 
tial to resolve this central dilemma by enabling a better match between resource 
supply and demand on the basis of more informed trade-offs between quality, 
performance, and economy [66,73,84] based on validated ground truths. NAM 
is supposed to observe and react quickly to quality problems, at best before 
customers perceive them and decide to churn. It should ensure that sufficient 
quality and performance are provided while constraining the application (and 
its underlying service building blocks) to behave as resource-efficiently as possi- 
ble in order to minimize operational costs. 

Figure 1 provides a high-level overview of managing resources and quality 
in the context of networked multimedia and communication services, where the 
management of networks and applications constitute complementary approaches. 
Network management (NM) focuses on monitoring and controlling the network 
entities of the delivery infrastructure on access, core network, and Internet level. 
The goals of network management typically are efficient resource allocation, 
avoidance of Quality of Service (QoS) problems (like packet loss from congestion) 
and generally keep the network “up and running” without faults. In contrast, 
application management (AM) aims to adapt quality and performance on end- 
user as well as application host /cloud level. 

In most cases nowadays, AM adapts the application to the conditions encoun- 
tered in the network as it is situated much closer to the user than network-level 
controls. For example, in the context of HTTP Adaptive Streaming (HAS), where 
the quality of the media stream (and consequently, its bitrate) is dynamically 
adapted not only to the network bandwidth available on the path between client 
and server, but also application layer parameters (like video buffer level) and con- 
text (like battery status). AM thus often acts as a “mediator” between network 
and the end user, while taking other aspects (application, user preferences, con- 
text) into account. While AM is being widely used in todays consumer Internet 
where traffic is transmitted on a “best effort” basis without taking into account 
the diverse quality requirements of different applications and users, it is only 
when network and application management are being used in conjunction, that 
the full potential of NAM can be reached [66,84]. 

In addition, there is a growing awareness within the scientific community 
and industry that technology-centric concepts like Quality of Service (QoS) do 
not cover every relevant performance aspect of a given application or service 
(cf. [30,65]) and to understand the related value that people attribute to it as 
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Fig. 1. Overview: Network- and Application Management operate at different control 
points along the delivery chain of services and applications (based on [66]). 


a consequence [21,48]. For these reasons, the concept of Quality of Experience 
(QoE) has gained strong interest, both from academic research and industry 
stakeholders. Being linked very closely to the subjective perception of the end 
user, QoE is supposed to enable a broader, more holistic understanding of impact 
and performance of network communication and content delivery systems and 
thus to complement traditional perspectives on quality and performance. 

While conceptualizations and definitions of QoE have dynamically evolved 
over time (cf. [67]), the most comprehensive and widely used definition of QoE 
today has emerged from the EU Qualinet community (COST Action IC1003: 
European Network on Quality of Experience in Multimedia Systems and Services): 
“Quality of Experience (QoE) is the degree of delight or annoyance of the user of an 
application or service. It results from the fulfillment of his or her expectations with 
respect to the utility and/or enjoyment of the application or service in the light of the 
users personality and current state. In the context of communication services, QoE 
is influenced by service, content, device, application, and context of use.” (Qualinet 
White Paper on Definitions of Quality of Experience (2012)) [61]. 

Thus in contrast to QoS, QoE not only depends on the technical perfor- 
mance of the transmission and delivery chain but also on a wide range of other 
factors, including content, application, user expectations and goals, and context 
of use. Understanding QoE thus demands for a multi-disciplinary research app- 
roach that goes beyond the network level. In particular, different applications 
have different QoE requirements (also including different QoS-dependencies), 
necessitating different QoE models, monitoring and eventually, different QoE 
management approaches. For example, while for securing the QoE of online 
video services, media playback quality (high resolution, no stalling, etc.) is of 
prime importance, the situation is different for online cloud gaming, where the 
reactiveness of the system is at least equally relevant. 
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For these reasons, the potential of extending QoS-focused traditional NAM 
towards QoE-based NAM! seems large. According to existing literature (cf. 
(6,66, 70, 73,84]) QoE-based NAM is supposed to yield the following main advan- 
tages: 


— More efficient and effective utilization of resources (network bandwidth, radio 
resources, CPU, etc.) by performing informed trade-offs (e.g. low latency 
vs. less playback interruptions when increasing buffer sizes) and maximizing 
impact of resource allocation 

— Increased satisfaction of users, plus the resulting economic benefits (increased 
loyalty, reduced churn, ability to upsell, etc.) 

— Maximization/balancing of user satisfaction over the whole customer popula- 
tion (QoE Fairness) 

— Ability to quickly detect/anticipate problems that really matter and solve 
them in real-time or even before the customer perceives them 

— Ability to charge for value (i.e. high quality or reliability of the service) that 
has actually being delivered to the customer, enabling new business models 


Given these potential improvements and benefits, the overarching key ques- 
tion of this chapter is: “How can QoE frameworks, tools, and methods be used to 
substantially improve the management of future applications and networks?” To 
this end, this chapter discusses the potential of QoE-driven NAM for future net- 
works in the light of current research. In this context we aim to take into account 
challenges arising from current and future applications (like Virtual Reality, or 
VR, and Augmented Reality, or AR), as well as the ongoing transformation of 
communication networks by emerging technologies such as Network Functions 
Virtualization (NFV), Software-defined Networking (SDN), and Mobile Edge 
Computing (MEC). It first provides an overview of the most relevant theoret- 
ical frameworks related to QoE management in Sect. 2. Then selected concrete 
research on promising technical approaches to QoE-driven NAM that operate 
across different layers of the network stack presented. Finally, the chapter pro- 
vides an outlook on the future of QoE management with a focus on those key 
enabling technology that will be essential for realizing the vision of truly effective 
QoE-aware network and application management. 


2 Towards a Generic Framework for QoE-Driven 
Network and Application Management 


In this section, we first derive the key components and challenges of QoE man- 
agement by surveying recent literature discussing QoE management and related 
frameworks. Then we present a generic framework for QoE-driven Network and 
Application Management (NAM). 


1 QoE-based NAM refers to QoE management applied to the domains of telecommu- 
nications and multimedia. 
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2.1 Key Components and Key Challenges 


Technical Frameworks and Challenges. Schatz et al. [66] provide a com- 
prehensive overview of previous work in this area before 2014, distinguishing 
between network and application management. When it comes to network man- 
agement, fault and performance management represent areas of specific impor- 
tance. QoE-driven network resource management is the second main topic in this 
context. Several works on resource management targeting two different parts of 
the network (access and core network) are discussed as well as QoE-based net- 
work management in multi-operator settings. The part on QoE-based application 
management focuses on management schemes explicitly designed for UDP /RTP- 
based and HTTP Adaptive Multimedia Streaming. Finally, the authors demon- 
strate a usefulness of a joint network and application management on two very 
distinct application scenarios, i.e. QOK management for managed services and 
Over-The-Top (OTT) Video. Finally, the authors state that a key challenge to 
be addressed by the research community relates to clarification and ensuring a 
common understanding of the meaning of different concepts and notions (like 
quality and performance) as well as highlighting their importance for different 
stakeholders. 

Furthermore, a survey published in [11] authored by Barakovic et al. presents 
an overview, key aspects and challenges of QoE management focusing in par- 
ticular on the domain of wireless networks. The paper addresses three aspects: 
QoE modeling of the QoE management, i.e. monitoring and measurement, QoE 
adaptation and optimization. When it comes to the first aspect, i.e. monitor- 
ing and measurement, the authors have concluded that different actors involved 
in the service provisioning chain will monitor and measure QoE in different 
ways, focusing on those parameters over which a given actor has control (e.g., a 
network provider will monitor how QoS-related performance parameters will 
impact QoE, a device manufacturer will monitor device-related performance 
issues, while application developers will be interested in how the service design 
or usability will affect QoE). Moreover, the authors identify the following four 
monitoring challenges, which should be properly addressed by the research com- 
munity: (1) Which data to collect?; (2) Where to collect?; (3) When to col- 
lect?; and (4) How to collect? On the other hand, a part dedicated to the last 
two aspects was concluded with a statement that in most situations the user 
perceived QoE will depend on the underlying network performance. However, 
network-oriented QoE optimization processes would clearly benefit from per- 
ceived quality feedback data collected at the users side, since QoE is inherently 
user-centric. Similarly as in the previous case, the authors identify the following 
four control challenges that arise in this context: (1) What to control?; (2) Where 
to control?; (3) When to control?; and (4) How to control? 

On the other hand, when it comes to frameworks, Liotou et al. present in 
[53] a conceptual framework toward QoE support, described in terms of func- 
tionalities, interactions, and design challenges. The framework consists of three 
main building blocks, i.e., a QoE-controller, QoE-monitor, and QoE-manager, 
all part of a “central QoE management entity”. The QoE-controller plays a role 
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of an interface between a central QoE management entity and underlying net- 
work, synchronizing communication exchange in both directions. It is in charge 
of configuring a data acquisition process, by requesting and collecting feedback 
from appropriate data sources. It also provides the collected data to both the 
QoE-monitor and QoE-manager. Finally, the QoE-controller applies the corre- 
sponding QoE-aware control decisions back to the network, during a final step of 
the QoE management loop. On the other hand, the QoE-monitor is responsible 
for estimating the QoE per flow, that is, per users session, and for reporting this 
to the QoE-manager. The QoE-manager is in charge of conducting any type of 
customer experience management or QoE-aware network management. Regard- 
ing the first building block, i.e. the QoE-monitor, and challenges in this context, 
it is of crucial importance to select and implement the most convenient QoE esti- 
mation model for an application scenario of interest as its accuracy and reliability 
can rapidly influence a precision and reliability of actions done by other building 
blocks of the framework and therefore also of all the QoE management process. 
When it comes to the QoE-controller and challenges in this case, it becomes 
even more complicated. Firstly, a selection of appropriate nodes to be used for 
an acquisition of QoE-related input is of a strategic importance. Secondly, an 
appropriate type of collected QoE-related input represents the other challenge. 
The authors also discuss some realization issues and challenges in the paper, e.g. 
a physical location and type of a QoE management framework’s implementation 
as well as power requirements for collecting QoE data. Besides the technical 
challenges listed above, an operator interested in implementing this framework 
has to take some business and legal aspects, which are clearly highlighted and 
discussed in the paper, into account. Finally, the authors showcase usefulness 
and efficiency of the proposed framework via an LTE case study. 

Another framework termed an autonomous QoE-driven network management 
framework designed by Seppanen et al. is described in [70]. The authors consider 
the proposed framework generic and applicable to a broad range of systems. The 
framework represents a part of a complete customer experience management sys- 
tem. It consists of three layers, i.e. a data acquisition, monitoring, and control layer. 
The data acquisition layer is in charge of collecting all raw data by probes or other 
means of data collection. On the monitoring layer, the raw data produced by the 
data acquisition layer is processed into knowledge about a state of the network, 
which is in turn passed to the control layer. The control layer performs actions 
upon the network based on this knowledge. The authors verify the performance 
and effectiveness of the proposed framework by several tests, where RTP video 
streams were subject to a quality-driven network control. In all the cases, the tests 
result in an improved quality for the relevant clients. More specifically, the tests 
show that it is possible to improve the quality perceived by premium users with- 
out sacrificing the quality of the streams belonging to normal users. Finally, the 
authors claim that the framework is not only able to make a good decision for a 
given time instant, but also to predict the outcome of different decisions and pick 
the most optimal one. In other words, the framework is not only able to improve 
the perceived quality of the selected streams, but also to be conservative with the 
available resources and identify when they are really needed. 
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Non-technical Frameworks and Challenges. Both [53,70] highlight the 
benefits of performing QoE-driven and application-aware network management. 
In general, information exchange and cooperation among players involved in ser- 
vice delivery has the potential to improve the effectiveness of QoOEK mangement 
schemes [35]. However in practice, involved players need incentives to engage 
in cooperative efforts (e.g., information exchange, content caching, etc.) due to 
conflicting goals and interests. For example, a cloud provider might aim to max- 
imize end-customer QoE, while a network provider might aim for maximizing 
the efficient use of network resources. Thus, the overall goal of QoE-driven NAM 
depends on the actual stakeholder group(s) involved in its realization. Examples 
for such goals are: maximizing the QoE of a given customer (end user perspec- 
tive), maximizing overall average QoE of multiple customers in a cell/segment 
while maintaining QoE fairness (ISP perspective), or maximizing the number of 
satisfied users while minimizing resource consumption (network/cloud provider 
perspective). 

In this sense, the overall goals of QoEK management strongly depend on the 
stakeholders or groups of stakeholders taken into account. As regards the lat- 
ter, the challenge to address in this context is: to which extent can cooperative 
management schemes and underlying business models involving multiple players 
achieve efficient management of network/system resources, while enhancing cus- 
tomer QoE? ISPs employ various traffic engineering mechanisms to keep their 
infrastructures running efficiently. Insight into the network requirements and 
adaptation capabilities of OTT services could aid them in making more efficient 
traffic management decisions. For example, information such as service utility 
functions and service adaptation capabilities could be used to perform cross-layer 
QoE-driven resource allocation among multiple simultaneous and competing ser- 
vice flows [40]. Furthermore, insight into application-level KPIs could aid ISPs 
in identifying user perceived QoE degradations and determining root causes of 
degradations. Given that a large portion of customer complaints aimed at ISPs 
stem from service provider problems rather than network operation problems, 
insights into the root causes of QoE degradations could help ISPs determine 
whether or not resolving a given problem falls within their domain. 

Offering application providers access to network-related performance infor- 
mation through APIs could provide the potential for enhanced network-aware 
adaptation decisions (e.g., adapt video streaming quality, or assign end users 
to servers such that end-to-end delays are minimized). Furthermore, insight 
into contextual information such as traffic load patterns can be used by 
service providers for optimizing service delivery. Similarly, offering network 
providers access to application-level requirements could provide the potential 
for application-aware and QoE-driven cross-layer resource management. 

In general, as stated in [27], cooperation opportunities can act as enablers for 
ISPs as well as content delivery infrastructure and service providers to jointly 
launch new applications in a cost effective way. For example, traffic-intensive 
applications such as the delivery of high definition video on-demand, or real- 
time applications such as online games, could benefit from cooperative QoE 
management solutions. 


56 R. Schatz et al. 


From a business oriented point of view, when considering QoE management, 
a key question is how to exploit QoE-related knowledge in terms of increasing 
revenue, preventing customer churn, and ensuring efficient network operations. 
Given a multi-stakeholder environment, business models driven by the previ- 
ously discussed incentives are needed to model the relationships between dif- 
ferent actors involved in the service-delivery chain. Ahmad et al. [6] address 
both the technical aspects and the motivation in terms of revenue generation for 
OTT-ISP collaboration. Their simulation results show that based on a proposed 
collaboration approach, there is a potential for increased revenue for the OTTs 
and the ISPs, stemming from increased customer satisfaction due to improved 
QoE. This work was further extended in [7] covering different perspectives of 
ISP and OTT collaboration in terms of QoE management, i.e., quality deliv- 
ery, technical realizations, and economic incentives. The authors propose and 
evaluate a QoE-aware collaboration approach between OTTs and ISPs based on 
profit maximization by considering the user churn of Most Profitable Customers 
classified in terms of Customer Lifetime Value. 

Consequently, an important consideration are the economical and moneti- 
zation aspects of QoE [77,85]. Examples of different business models may be 
foreseen exploiting the cooperation between ISPs and OTT providers as sum- 
marized by Liotou et al. [52]: 


— token-based models: charge a user according to a certain level of QoS/QoE; 
this may be accompanied with the purchase of a particular application, 

— contract-based models following a tiered approach with different bandwidths 
and quotas, and 

— Pay-as-you-go service models, where users are charged for a QoS/QoE level 
in relation to a particular service. 


Summary of Key Challenges. To summarize, the key challenges to be dealt 
with by the research community and practitioners in the near future, coming 
from the works surveyed above, can be divided into three main parts, i.e., chal- 
lenges related to QoK management as a whole (covering high-level conceptual, 
overarching technical and non-technical aspects of the QoE management), chal- 
lenges directly related to QoE monitoring and challenges directly related to 
QoE adaptation and optimization (i.e. control). The challenges are summarized 
in Table 1. 

When it comes to the challenges related to QoE management as a whole, it 
is critical to ensure a common understanding of different concepts and notions 
deployed in this context and to highlight their importance for different com- 
munities involved in the QoE management. Moreover, the physical location of 
a QoE management framework and the type of its implementation represent 
pragmatic challenges in this case. As also previously noted, legal and business 
aspects related to the implementation of a QoE management framework need 
to be addressed. This includes the different optimization goals and interests of 
different stakeholders. In this context, the willingness and (financial) incentives 
of different players involved in service delivery to disclose information to each 
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Table 1. Summary of main QoE management challenges 


QoE management as a whole QoE monitoring QoE adaptation and 
optimization 

Different concepts and notions used | Type and amount of What to control? 
in QoE management data to be collected 
Physical location of QoE manage- | Placement and Where to control? 
ment framework instances selection of the 
Conflicting stakeholder goals and collection point /points 
interests 
Type of QoE management Periodicity of data When to control? 
framework implementation collection and approach 

used 
Legal and business aspects related | Power requirements for | How to control? 
to practical implementation of QoE | collecting QoE data 
management 


other is a critical obstacle. Even though initial studies show promising results [6], 
until now the actual benefits have not been proven along the whole cost chain. 
Moreover, regulatory restrictions related to the network neutrality principle [33] 
may have a key impact on realizing possible cooperation scenarios linked with 
application-aware traffic management. 

Regarding the technical aspects of QoE monitoring, the following issues are 
open: the type of data to be collected (e.g., related to the service usage and 
configuration, network performance, user preferences, and context of use), the 
placement and selection of the collection point(s), the periodicity of the data col- 
lection and its approach together with a selection of the most convenient QoE 
estimation model for an application scenario of interest and the power require- 
ments for collecting QoE data. In the case of the QoE adaptation and optimiza- 
tion, the following four questions: (1) what to control?; (2) where to control’; 
(3) when to control?; and (4) how to control? should be properly answered by 
the community in the near future [11]. 

The following section brings together a number of the aforementioned chal- 
lenges by means of a generic framework for QoE-driven Network and Application 
Management. 


2.2 A Generic Framework for QoE-Driven NAM 


Several QoE-driven NAM approaches are investigated in [69]. The presented 
solutions focus on different applications and differ with respect to their specific 
management target. For example, some of the approaches aim on video quality 
fairness among heterogeneous HAS clients [37,43], while other works reduce the 
control delay of Skype with respect to bandwidth variations [83], or reduce video 
stallings for HAS [60]. Based on those existing approaches, the authors define 
monitoring and controlling of QoE indicators on network- and application-level 
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as key building blocks for a generic NAM framework. By focusing on the key 
functionalities, the framework can cope with a multitude of NAM approaches, 
despite the diverse objectives and applications that are covered. The presented 
framework is a first step towards addressing the QoK management challenges 
introduced in Subsect. 2.1, as it helps to achieve a common understanding of 
different concepts and can be used to compare different solutions, e.g. with 
respect to design choices like frequency or location of monitoring and control 
functionalities. 

We note that the focus in this section is on the technical realization 
aspects of NAM and not on the business aspects and financial incentives 
of multi-stakeholder cooperation. The presented framework assumes that there 
are underlying business models as well as contractual mechanisms (like SLAs, 
ELAs?) supporting the necessary cooperation between multiple involved stake- 
holders, such as OTT /service providers and network providers. 


Building Blocks. In order to supervise the state of running applications, Appli- 
cation Monitoring (AppMon) is performed, while Network Monitoring (NetMon) 
keeps track of network-related QoE influence factors (QoE-IFs). The collected 
information is communicated to a centralized instance, e.g. a Policy Manager 
(PM), which has an up-to-date global view of the network and the applica- 
tions running on top of it. Based on its knowledge, the PM is capable to com- 
pute appropriate control actions, which, on the one hand, can be performed on 
network-side. This is denoted as Network Control (NetCon). On the other hand, 
control actions can be performed on application-level, forming the Application 
Control (AppCon). Further, a joint optimization of application and network 
might be feasible for several use-cases. Repeatedly monitoring and controlling of 
application and network then form the control loop of those approaches. This 
is in line with two of the general steps for QoE management, as defined in [11], 
namely (1) QoE monitoring and measurements, and (2) QoE optimization and 
control. 

Besides the implementation of NAM building blocks, the abstract framework 
considers three optimization types: 


— Application-level Optimization (ALO) 
— Network-level Optimization (NLO) 
— Policy Manager Optimization (PMO) 


The location, where monitoring information is used to decide control actions, 
determines the optimization type. We shortly describe the optimization types 
and illustrate their employment in NAM approaches by providing examples in 
the following paragraphs. 


Application-level Optimization (ALO). The application collects significant 
information about network or application. Based on this knowledge, the appli- 
cation initiates adaptation or invokes network-level mechanisms. 


? The acronyms refer to service level agreements (SLAs) and experience level agree- 
ments (ELAs) respectively, cf. [77]. 
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Zhu et al. [83] propose an ALO-based cross-layer framework for OTT ser- 
vices like Skype conferencing. Their applied methods are similar to existing 
network layer techniques, such as Explicit Congestion Notification (ECN) and 
Differentiated Services (DiffServ). They use the existing ECN and DiffServ IP 
packet header fields to exchange cross-layer information like for example conges- 
tion and packet priority. Their results show a speed up of Skypes’s response to 
bandwidth variation by indicating congestion immediately. Further, the audio 
packets’ delays can be reduced by intra-flow prioritization. 

Adzic et al. [5] use a content aware method to determine whether the video 
quality gain (expressed in SSIM) when switching to a higher level justifies the 
additional bandwidth consumption. The researchers assume DASH streaming in 
a mobile environment, where high-speed bandwidth volume is mostly limited. 
DASH servers provide videos encoded in constant bitrate (CBR), that does not 
consider the video content. Thus, an increase in bitrate does not necessarily mean 
a remarkable increase in SSIM for each video sequence. This ALO-mechanism 
prevents the client from selecting a higher bitrate when video quality cannot be 
increased significantly. 


Network-level Optimization (NLO). The network collects significant infor- 
mation about network or application. Using this information, the network param- 
eter are adapted or instructions for adaptation are given to the application. 

NLO-centric mechanisms are discussed in the works of Wamser et al. [41, 79, 80]. 
All of them implement an estimator of the YouTube client’s buffer state by deep 
packet inspection performed in the network. Based on this information, different 
actions are performed in the network. The software-defined networking (SDN) app- 
roach [41] proposes a dynamical re-routing of traffic. Whereas, in [80] resources are 
flexibly aggregated from one or more access networks. A home network scenario 
was investigated in [79], as a network adaptation, YouTube flows are dynamically 
prioritized. The employment of these mechanisms supports clients that are at risk 
of an empty buffer. The methods applied in the network lead to a fast buffer re-fill 
of those clients and the video stream’s smoothness can be enhanced. 


Policy Manager Optimization (PMO). A centralized instance (PM) has 
knowledge about both, network and application state. Based on its global system 
view, it can orchestrate control actions for applications or network. 

A PMO approach called NOVA, short for Network Optimization for Video 
Adaptation, is developed by Joseph et al. [43]. The network regularly sends state 
updates to a so called base station. The client sent signal to the base station 
in case the video is in risk of a re-buffering event. An algorithm implemented 
in the base station computes the necessary bandwidth slices for each DASH 
client so that several QoE-IF's are optimized. The network controller performs 
the bandwidth slice allocation, the rate adaption is performed by the clients 
independently. 

The model for NAM approaches, including the different building blocks, opti- 
mization types, and monitoring/control information flow, is illustrated in Fig. 2. 
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Fig. 2. Abstract NAM model with its building blocks and optimization types. Blue 
arrows indicate monitoring information, yellow arrows represent control information. 
(Color figure online) 


It is very generic and omits details on the specific realization of the building 
blocks, e.g. distributed vs. centralized realization, or the monitoring layer, e.g. 
packet inspection vs. flow-based monitoring in the network. Frequency, location, 
and accuracy of monitoring dictate the level how fine-grained control actions can 
be performed and influence the potential of optimizing QoE. For instance, large 
time intervals between bandwidth probes restrict to a rough estimation that 
cannot consider short-time fluctuations. In turn, this leads to periods in which 
resources are under- or overestimated, resulting in non-optimal QoE due to the 
missing possibility to appropriately adapt to current network conditions. Never- 
theless, the authors intend this generality in order to allow a simple classification 
NAM solutions with respect to monitoring and control capabilities. Based on the 
proposed framework, we classify several QoE-driven NAM approaches (Table 2). 
Besides building blocks and optimization type, we also provide the monitored 
QoE-IFs and the considered applications of the presented NAM approaches. 
As video streaming represents the majority of today’s Internet traffic [23], it is 
largely discussed in current research. Accordingly, video streaming, is the dom- 
inant application among the approaches presented in the table. In particular, 
HAS is considered prevalently. To be able to adapt video quality, it already 
implements a control loop that monitors the network throughput or the client’s 
buffer filling level. However, there is variety of applications running on top of 
future networks, e.g. VR applications or 3D and 360° video streaming. The QoE 
requirements of those applications need to be evaluated in order to facilitate 
QoE optimization. [32] proposes a QoE management approach for Cloud gam- 
ing, which can be seen as one representative in that direction. Furthermore, it 
shows that the generic functions of the NAM model also suit for applications 
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Table 2. Classification of different NAM approaches w.r.t. optimization type and 
frequency of control and monitoring actions. (I = Initial, T = Triggered, P = Periodical, 
ALO = Application-level optimization, NMO = Network-level optimization, PMO = 
Policy Manager optimization) 


service features, network resource 
availability 


Opt Net- |Net- | App- | App- | Monitored QoE-IFs Considered 
type Mon |Con | Mon |Con Applications 

83] ALO |P T P P Media encoding bitrate, network Skype 
congestion 

54] ALO |P P P P Media encoding and encoding bitrate, HAS 
network bandwidth 

5] ALO l- - P P Media encoding and spatial/temporal HAS 
characteristics, network bandwidth 

79 NLO |P T P - Video buffer, network bandwidth YouTube 

41 NLO |P P P - Packets in the network (DPI), network | YouTube, 
bandwidth HAS 

24 NLO JI P I P Active DASH streams, network HAS 
resources, client properties, network 
bandwidth 

25 NLO |P T I - Packet loss, transmission delay IPTV, Audio 

57,58] |NLO |P P I - Not specified Mobile 

applications 

22 NLO |P - P P Encoding bitrate, user subscription, HAS 
operator cost, network bandwidth 

37 NLO |P T - - Network bandwidth HAS 

80 NLO |P P P - Video buffer, network bandwidth YouTube 

29 PMO |P I P Media encoding and encoding bitrate, HAS 
device resolution, network bandwidth 

32 PMO |P - P T Available bandwidth, active gamers, Cloud 
client setup information, available games | gaming 

60 PMO |P T P P Media encoding and encoding bitrate, HAS 
video buffer, network bandwidth 

56 PMO |P T P - Video encoding and encoding bitrate, HAS 
buffering status, network throughput, 
packet loss 

43 PMO |P P T - Media encoding and encoding bitrate, HAS 
video buffer, network bandwidth 

19 PMO |P T P P Required throughput per client, available | HAS 
bandwidth, network latency, end device 
properties, video buffer, video quality 

63 PMO |P P P P Metadata of video content, video buffer, | HAS 
device resolution 

44 PMO |P T I T User preferences, device capabilities, Audio/video 


call 


other than video streaming. The classification of NAM approaches reveals that 
the solutions differ with respect to capability, location, and frequency of control 
and monitoring functionalities, and that various QoE indicators are considered. 
This highlights that the challenges concerning QoE monitoring and QoE adapta- 
tion still need to be discussed by the community, as outlined in Sect. 2.1. In order 
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to compare different NAM solutions and to investigate the impact of different 
monitoring and control capabilities on QoE-IFs, the authors set up a measure- 
ment environment that implements the key building blocks and facilitates the 
interaction between the involved entities. Initial testbed-driven results and a 
quantitative analyses of two NAM approaches are presented in [68]. 


3 Specific QoE Management Approaches 


This section presents the results of selected QoE management related research 
conducted in the context of COST Action IC1304, serving as examples illus- 
trating how some of the aforementioned challenges related to QoE-driven NAM 
can be effectively addressed. To this end, we present work on multidimensional 
QoE modeling, QoE management by differentiated handling of signaling traffic 
as well as QoE management with SDN. 


3.1 Multidimensional Modeling as a Prerequisite for Effective 
QoE Management 


What is often neglected in the process of QoOE management (described in Sect. 2) 
is that an essential prerequisite for success is a deep and comprehensive under- 
standing of the influence factors and multiple dimensions of human quality per- 
ception and how they may impact QoE in future networks and services, given 


Better understanding of features’ 
and factors‘ impacts 
( QoE Multidimensional Model ) 


QoE Perceptual Features 


2 
5 
£ 
9 
© 
£ 
m 
& 
i 
o 
3 
= 
£ 
ra 
g 
a 
2 


Context Influence Factors 
Network Influence Factors 


Application Influence Factors 


Fig. 3. Multidimensional modelling of QoE. 
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that humans are quality meters [49]. With the advent of 5G, the range of offered 
services, application domains, and the context in which they shall be used in 
will increase significantly. Consequently, factors such as explosive growth of data 
traffic volume, number of connected devices, continuous emergence of new ser- 
vices and applications, etc., will all contribute to the complexity of managing 
QoE [82]. The challenge is even greater, given that all above mentioned targets 
should be addressed in a multidimensional fashion (multiple factors and fea- 
tures) and from the point of various actors in the service provisioning chain. In 
this respect, multidimensional QoE modeling aims to quantify the relationship 
between different measurable QoE influence factors, quantifiable QoE features 
(or dimensions), and QoE for a given service provided by future environment (as 
given in Fig. 3). 


Approaches and Results. In search for multidimensional modeling 
approaches to QoE in terms of multimedia services, one may note that most 
studies, addressing different factors and features that impact and describe the 
user experience or quality thereof (referring also to studies of user satisfaction or 
preferences), focus on a limited set of factors and features. Hence they offer an 
incomplete view of the user experience and QoE. For example, as summarized 
in [75], we have, models for file transfer [74], Voice over IP [38], video stream- 
ing [34,36,45,71], online video [39], etc. which are based on weighted impacts 
of system influence factors (that is Quality of Service technical parameters). 
What is generally missing, is a multidimensional approach to QoE modeling, 
i.e., the quantification and deeper understanding of multiple influence factors 
affecting QoE and features describing it, together with their mutual interplay 
[12,13]. 

Following the idea, authors in [72] give a generic framework for QoE in a 
multidimensional fashion by offering an ARCU (Application-Resource-Context- 
User) model which categorizes influence factors into four multidimensional spaces 
and further maps points from these spaces to a multidimensional QoE space, 
representing both qualitative and quantitative QoE metrics. What this study 
lacks is the concrete implementation on given multidimensional services and 
consequently the results. 

However, studies that have started to address QoE modeling as an impor- 
tant part of QoE management in a multidimensional fashion, operated in a Web 
environment. In a stationary/desktop Web context, authors in [76] used a multi- 
dimensional approach to investigate Web QoE by focusing on evaluation of three 
key dimensions that contribute to overall Web QoE: perceived performance, aes- 
thetics, and ease-of-use. Key results have shown that page loading time and 
visual appeal have a significant effect on overall user QoE and that both, higher 
perceived aesthetics and ease-of-use, result in an increased user tolerance to 
delay. Also, the research proved that there exists strong correlation between 
overall QoE and perceived aesthetics, ease-of-use, and network performance. 

In the mobile Web browsing context (browsing information, thematic, and e- 
mail portals via both a smartphone and tablet), authors in [12,13] have proposed 
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multidimensional models that represent and quantify mutual relations of QoE and 
key features, i.e., perceived Web site loading time, perceived aesthetics of Web site, 
perceived usability of Web sites, and perceived quality of Web site information, as 
well as key IFs and QoE features. These studies follow the principle given in Fig. 3 
and their contribution is three-fold. Firstly, QoE in a mobile Web browsing context 
is addressed as a multidimensional concept. Then, the authors have shown that the 
impact of page loading time, aesthetics, usability, and quality of information pro- 
vided by Web sites on mobile Web QoE exists. Finally, mutual relations between 
QoE and its features, as well as QoE IFs and features are quantified, and based 
on the obtained models, one is able to identify the importance (impact degree) of 
distinct dimensions in terms of considered perceptions and overall QoE. There- 
fore, the perception of Web site usability, aesthetics, loading time, and quality of 
information respectively in that order differ in the degree to which they impact 
the overall QoE (going from most to least influential) regardless of performed task 
or used device in a mobile Web browsing context [12]. In other words, the multi- 
dimensional models for mobile Web browsing QoE show that the most important 
perceptual dimensions were found to be perceived Web site usability and aesthet- 
ics, respectively, and that they impact QoE in a mobile environment more than the 
perception of Web site loading time, which was previously found to be the most 
influential in a desktop environment. The extension study given in [13] shows that 
in case of perception of Web site loading time, Web site loading time and number 
of taps respectively in that order differ in the degree to which they impact this QoE 
feature (going from most to least influential) in all considered cases (information, 
thematic, and e-mail portal). The number of taps, aesthetics of Web site, Web site 
loading time, and quality of Web site information respectively in that order differ in 
the degree to which they impact perceived usability (going from most to least influ- 
ential) in all considered cases except when browsing the thematic portal (regard- 
less of used device). Namely, when browsing the thematic portal via mobile device, 
Web site loading time and quality of Web site information switch places, i.e., the 
resulting order of impacts is: number of taps, aesthetics of Web site, quality of Web 
site information, and Web site loading time. The aesthetics of a Web site, number 
of taps, and quality of Web site information respectively in that order differ in the 
degree to which they impact the perception of aesthetics (going from most to least 
influential) in all considered cases except in the case of browsing the thematic por- 
tal via a smartphone, where the number of taps and quality of information switch 
places. The quality of Web site information, aesthetics of Web site, and number of 
taps to reach desired Web content respectively in that order differ in the degree to 
which they impact the perception of quality of Web site information (going from 
most to least influential) in all considered cases. 


Conclusion. It is clear that not all factors can be addressed together in a 
single study. Therefore, the focus should be on exploring the impact of a chosen 
key set of influence factors and their perceptions (QoE features) on the user 
rating of overall perceived QoE for a given multimedia service in future network 
environment in a multidimensional fashion [14]. Based on that, one would be 
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able to identify the importance of distinct dimensions in terms of overall user 
perceived QoE and consequently contribute to better QoE management, which 
represents an ultimate goal. 


3.2 QoE Management by Differentiated Handling of Session-Control 
Signaling 


As previously discussed in Sect. 2.2, the applications are responsible for collecting 
important information about the network or the applications runing on top of 
it, which are used for application-level adaptation or invocation of network-level 
mechanisms. These tasks concern the application-level signaling as being the 
main source of network intelligence, analysis, and user experience monitoring. 
The cooperation between the application- and network-level mechanisms may be 
realized by using different application-level signaling protocols, such as Session 
Initiation Protocol (SIP) or Hypertext Transfer Protocol (HTTP) [15]. While the 
new versions HTTP (i.e., HTTP/2) and HTTP alternatives (e.g., Stream Control 
Transport Protocol (SCTP) or Quick UDP Internet Connections (QUIC)) are 
the dominant signaling protocols in the Internet domain, the SIP is more used 
in the telecoms domain in the context of real-time communication services. The 
increasing usage of these services requires the real-time processing of growing 
amount of SIP signaling. In order to cope with the explosion of SIP signaling, 
the mechanism for differentiated handling of SIP messages is needed to increase 
the service quality, while decreasing the load of session-control resources [15]. 

In order to ensure high availability and reliability of SIP servers, different over- 
load protection mechanisms have been previously discussed in the signaling per- 
formance context [31]. Many research activities have been performed to provide 
the SIP overload control by considering various parameters such as call rejection 
[10,51,81], session aware [42], and response time [55]. Moreover, the increasing 
usage of SIP signaling has resulted in the need for creating a methodology for SIP 
server performance measuring [78]. Different SIP performance metrics have been 
evaluated for that purpose in various environments, such as Internet Protocol (IP) 
multimedia subsystem (IMS) [16], Asterisk IP private branch exchange (PBX) 
[47], long term evolution mission critical systems (LTE-MCS) [8,9], content- 
aware network (CAN) and content-centric network (CCN) [62]. Considering the 
related work, it can be noticed that most of the research activities have been 
focused on analyzing the impact of session-control signaling on Quality of Service 
(QoS). However, acceptable QoS does not guarantee that end user will experience 
acceptable QoE. 


Approaches and Results. In this regard, an algorithm for SIP message clas- 
sification and priorization [18] (which is implemented in an NS-2 simulation 
environment [17] and on an Kamailio SIP server) has been investigated in terms 
of QoE impact. Serving as a mechanism for QoE management at the application 
layer shown in Fig. 4, this algorithm allows the optimization of SIP signaling 
procedures especially under high-load or overload conditions through improving 
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SIP performance metrics, i.e. registration request delay (RRD), session request 
delay (SRD), and session disconnection delay (SDD). This is a consequence of 
preferential handling of SIP messages used for session termination in comparison 
with the SIP messages used for session establishment, which allows the faster 
release of allocated resources and thereby improves the billing user experience. 
Moreover, this prevents the setup of new sessions under overload conditions until 
sufficient resources are available. This may lead to user experience improvement 
since users do not accept a service degradation or interruption once they have 
started a session. They would rather have the session to be blocked whenever 
the resources are not able to carry it with the appropriate quality. 

On the basis of the foregoing considerations, there was a need for analysis 
of the interaction between session-control signaling, QoE and user perceptions 
of signaling performance metrics. With the aim of verifying the proposed algo- 
rithm for SIP message classification and priorization in user-oriented context, a 
research study has been conducted in order to obtain data for explaining the 
overall user satisfaction and satisfaction with SIP signaling procedures, i.e. reg- 
ister, session establishment, and session termination procedures, under different 


Application Layer 


Application Server 


High Priority Queue 


Session Control 
Processes 


Classifier based on 
SIP message type 


Access Network 


r---- 


Fig. 4. QoE management by differentiated handling of session-control signaling. 


QoE Management for Future Networks 67 


m Without algorithm m With algorithm 


4,5 
4 
3,5 
3 
2,5 
2 
15 
1 
. l | 
0 
l 1 ii v v vI vil 


VIII IX 


Average MOS 


Signaling load dependent measurement points 


(a) 


m Without algorithm Œ With algorithm 


0,9 
0,8 
0,7 

wn 

° 

= 06 

= 

o 

5 05 

= 

S 

a 04 
o 
nol 
© 0,3 
iy 
S 
R 02 

0 

I Il Ml IV V VI VII VIII IX 


Signaling load dependent measurement points 


(b) 


Fig. 5. The dependence of user perception of overall QoE in terms of MOS on the 
signaling load and algorithm implementation: (a) Average MOS; (b) Standard deviation 
of MOS. 


load conditions. It has been found that session-control signaling plays its part in 
affecting the user QoE with Voice over Internet Protocol (VoIP) services. More 
precisely, a strong and negative impact of session-control signaling load on user 
perception of SIP performance metrics (i.e., RRD, SRD, SDD) and overall QoE 
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has been determined. Fig.5 shows the user perception of overall QoE expressed 
in terms of mean opinion score (MOS) in dependence on the session-control 
signaling load and algorithm implementation. Furthermore, the linear model is 
proposed to describe the relation between user perception of SIP performance 
metrics and overall QoE with VoIP. 

In addition, it has been shown that the proposed algorithm for SIP message 
classification and priorization affects user perception of SIP performance metrics 
and overall QoE with VoIP by decreasing the strong impact of session-control 
signaling load on considered SIP performance metrics. Therefore, the additional 
model has been provided to evaluate the mutual relations of user perception of 
distinct SIP performance metrics and QoE. This has allowed us to determine the 
importance of various SIP signaling metrics according to the listed order, going 
from most to least influential: user perception of SRD, user perception of SDD, 
and user perception of RRD. 

Moreover, since the algorithm for SIP message classification and priorization 
may be used for service differentiation, it has been investigated whether dif- 
ferentiated handling of SIP messages affects the quality of unified communica- 
tion (UC) service components (i.e., QoS for voice/video calls, instant messaging 
(IM)/presence status) or not. It has been preliminary found that there is no 
statistically significant impact of SIP message differentiation on the QoS for the 
voice/video calls, IM/presence status. Nevertheless, the future work will address 
the impact of the differentiation of UC service components on QoE in different 
contexts. 


Conclusion. Although the importance of session-control signaling has been 
already emphasized in the field of QoS, it has been considered to a limited extent 
in terms of the QoE. The performed research study has focused on the interaction 
between session-control signaling, QoE and user perception of SIP signaling perfor- 
mance metrics. The research findings indicate that session-control signaling load 
negatively affects the user perception of SIP signaling performance metrics and 
overall QoE with the VoIP service. On the other hand, it is shown that differen- 
tiated handling of session-control signaling does not affect the QoS of UC service, 
whereas its impact on QoE in this context is planned for the future work. Therefore, 
one may conclude that further investigation of this application-level mechanism for 
QoE management is needed not to draw the misleading conclusion. 


3.3 QoE Management with SDN 


Motivation. Subsection 2.2 provides an overview of several QoE-driven NAM 
approaches, while this one expands on that overview so as to present the 
approaches that exploit the relatively new Software-Defined Networking (SDN) 
paradigm [46] in particular. Communication networks are already undergoing 
an immense transformation in light of this paradigm. SDN commonly refers 
to the separation of the network control and data planes, allowing a network 
infrastructure to be configured from a central point, an SDN controller (SDNC), 
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by the means of software. This configuration flexibility is facilitated by open 
Northbound and Southbound interfaces of the SDN architecture, which enable 
exchange of information among different functional entities of that architec- 
ture in a well-defined manner. QoE management is not left unaffected from the 
advancements of SDN technologies, since they bring in new potentials in terms 
of a) identifying novel use cases for QoE control beyond the pre-SDN era, and 
b) proposing new architectures and frameworks to achieve that. From the QoE 
standpoint and the related basic functions of monitoring, reporting and manage- 
ment, SDN architecture provides several benefits which are illustrated in Fig. 6. 

Network-level QoE monitoring by an SDN infrastructure operator is sim- 
plified, since SDNC autonomously builds a “global view” on the network with 
respect to its topology and performance indicators (such as throughput and 
packet loss statistics). This enables the network operator to apply different 
QoE-centric optimization strategies and enforce optimal, network-wide deci- 
sions. Then, SDN architecture envisages the open interfaces that would ease QoE 
reporting by end-user clients and application servers on monitored application- 
level QoE influence factors (IFs). These interfaces would provide a basis to realize 
cooperative QoE management between end-user applications and the underly- 
ing network. For presentation simplicity, Fig.6 only outlines QoE reporting on 
application-level IFs. The latter QoE IFs are passed on to an SDN application 
called “QoE mediator”, which runs on top of an SDNC and is responsible for, 
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Fig. 6. QoE functions in the SDN scope. 
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e.g., generating aggregated QoE reports from multiple end users/applications. 
Aggregated QoE reports are delivered to the SDNC over a Northbound interface 
that it exposes. For a realization of the cooperative management, QoE media- 
tors may also serve as, e.g., PMOs and instruct end-user clients or application 
servers which management actions to enforce. The latter role would require QoE 
mediators to employ an interface that supports conveying application-level man- 
agement instructions as well. With regards to network-level QoE management, 
SDN facilitates per-flow network forwarding decisions and differentiated traf- 
fic treatment (e.g., video streaming vs. audio conferencing) on various levels of 
granularity, thus adding to overall QoE management flexibility. 


Approaches and Results. Recent research papers show that this potential has 
already been acknowledged and taken advantage of. So far, most attention has 
been paid on how SDN’s Southbound interface and, one of its main realizations, 
the OpenFlow (OF) specification [2] can be used to program network switches in 
compliance with operator policies, which enforce automated flow manipulation. 
This way, different traffic management schemes can be applied, such as to re- 
calculate and adjust traffic routes, prioritize traffic handling in the switches, or 
employ network admission control. A high-level view of an SDN-based system 
that would maximize user QoE by optimizing path selection process is described 
by Kassler et al. [44]. The authors present general requirements of such a system 
that would consider demands and parameters of various multimedia flows, in 
terms of media codecs, flow bitrates, end-to-end (E2E) delay, etc. To achieve 
it, an SDNC would be used to collect information on multimedia applications, 
build a global network view, and install optimal routing decisions. QoE Fairness 
Framework (QoE-FF) for adaptive video streaming is presented by Georgopoulos 
et al. in [29]. The goal of QoE-FF is to find an optimal point of video quality 
requests among heterogeneous end-user clients competing for the same network 
resources. To realize such a goal, QoE-FF relies on a PMO entity that collects 
client device characteristics influencing QoE (e.g., screen resolution) and video 
service features (such as supported content bitrates) via a Northbound interface, 
as well as network bandwidth status, and then impose the respective bitrate 
demand on each client. Jarschel et al. [41] describe an SDN-based approach that 
investigates route selection strategies in order to improve QoE for YouTube users. 
For the application-aware strategy, an SDNC exploits the estimated playout 
buffer status and traffic demand on network bandwidth to choose a less congested 
network path. 

Nam et al. jointly optimize the selection of video delivery nodes in a Content 
Distribution Network (CDN) and network routes in the base SDN infrastruc- 
ture [56]. Their approach is built on monitoring application-level QoE IFs, such 
as initial reproduction delay and buffering rate, and calculating a new path in 
response to QoE degradations. An approach to dynamic network bandwidth 
reservation that optimizes QoE among multiple competing video clients is out- 
lined by Ramakrishnan et al. in [63]. This approach employs the scheme of allo- 
cating bandwidth to each client that considers QoE IFs in terms of, for example, 
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specific content type (e.g., dynamic vs. static video scenes), media codec used, as 
well as client’s playout buffer level. Implementation-wise, an SDN-centric archi- 
tecture is proposed that revolves around the QoE optimization application on top 
of PMO-like SDNC. This SDN application obtains information on client’s device 
type and its buffer status, requested video sequence(s) and base network topol- 
ogy, and then tailors the bandwidth reservation in cases of congested network. 
M. Eckert and T. M. Knoll present the Internet Service quality Assessment and 
Automatic Reaction (ISAAR) framework [26], which encompasses QoE manage- 
ment functions for an SDN-enabled mobile network, namely network-level QoE 
monitoring and control. ISAAR exploits the SDN capabilities for (a) flow-based 
QoE estimation, which is realized by OF flow detection and selective packet cap- 
turing, and (b) QoE control enforcement via OF traffic prioritization and other 
traffic engineering (TE) techniques. Ramakrishnan et al. in [64] describe an SDN- 
based architecture that allows the generation of QoE metrics (e.g., PSNR) and 
QoE analytics. To achieve that, a “Video Quality Application” (VQA) queries 
information from an SDNC regarding video content, user devices, and network 
performance. 

This emerging interest of tackling QoE management with SDN is also visible 
from recent European research projects. The CASPER project (http: //casper- 
h2020.eu/) exploits SDN and NFV advancements towards improving end-user 
QoE in wireless networks, focusing on voice, data and traditional video applica- 
tions. A novel framework is proposed, targeting its integration by mobile oper- 
ators. Moreover, the INPUT project (http://input-project.eu/) aims to extend 
SDN and NFV paradigms, in order to pave the way for personal cloud services 
and functionalities with the goal to optimize QoE. Also, 5G NORMA (https:// 
5gnorma.5g-ppp.eu/) envisions a flexible architecture that enables the multi- 
service- and context-aware adaptation of network functions to support a variety 
of services and corresponding QoE/QoS requirements. Finally, project CROSS- 
FIRE (http://mitn-crossfire.eu/) has investigated the sharing of the same phys- 
ical infrastructure by multiple network operators with the objective to optimize 
network operation and enforce QoE management by the means of SDN/NFV. 


Conclusion. To summarize, most of the SDN-based solutions and frameworks 
for QoE management focus on a single end-user application, such as video 
streaming, and on network-level management mechanisms, but without provid- 
ing specific details on technical SDN realization and important architectural 
aspects. In the latter, some of the key parts missing relate to: (1) a coordi- 
nated approach in distributed QoE monitoring, and (2) common interfaces for 
reporting on QoE IFs, which regard applications and the underlying network, 
but also end-users and general context information. Furthermore, the outlined 
approaches are often use-case-specific and do not discuss general guidelines on 
how to extend them so as to achieve more comprehensive QoE management 
solutions. 
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4 Outlook: Future Evolution of Software-Based QoE 
Management 


As discussed in the previous sections, QoE provisioning within the current net- 
worked communication paradigm is a very challenging task, since the service 
delivery chain involves multiple stakeholders typically with competing inter- 
ests (OTT service providers, traditional Mobile Network Operators (MNO), and 
Internet Service Providers (ISPs)). As a result, true E2E QoE management of 
a service is currently impossible, since data traffic produced by an OTT is sub- 
ject to the network quality provided by an MNO, before it reaches the end- 
user. However, new technological advancements bring hope towards overcoming 
this isolation and truly enabling a holistic, E2E, cross-layer (i.e. network-level 
and application-level) QoE management. These identified technologies are SDN, 
NFV and MEC. Although MEC and NFV are driven by the same motives and 
follow similar design principles, according to [4], they are “complementary con- 
cepts that can exist independently”; therefore, they are examined separately 
below. 


4.1 Software-Defined Networking (SDN) 


First of all, SDN is a promising technology towards the direction of software- 
based QoE management (see Sect. 3.3). SDN, as of today, is mainly a tool used 
by operators of the network infrastructure to enforce traffic management policies 
within their domain, leaving the potential of a joint orchestration at the network 
and application levels unexploited. Nevertheless, SDN enables an abstraction of 
the network infrastructure, which, combined with the necessary SDN interfaces, 
can facilitate a closer collaboration between MNOs/ISPs and OTTs, respecting 
in parallel privacy concerns of each stakeholder. This visionary approach has 
been acknowledged by strategic white papers [3], as well as research papers, 
such as [6] and [52]. 

To achieve the full potential of QoE management with SDN, well-defined 
interfaces capable of realizing QoE reporting for different multimedia services are 
needed. Such interfaces would allow the SDN architectural elements, namely end- 
user clients, application servers, SDN applications, SDNCs and infrastructure 
devices, to convey information on all relevant QoE IFs. These interfaces can be 
scenario-specific and open, introducing great flexibility to 3rd parties who can 
program proprietary applications that use these interfaces. In this way, not only a 
comprehensive view on QoE could be formed, but also a more rapid design and 
implementation of QoE management frameworks would be enforced. Another 
important technological aspect that should be addressed is the identification of 
how generic QoE management blocks (see Sect.2.2) are “mapped” to concrete 
SDN architectures and combined to create an efficient management cycle. The 
latter calls for the specification of different management strategies with regards 
to SDN monitoring and network-level traffic treatment as well. 
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4.2 Network Functions Virtualization (NFV) 


The NFV paradigm enables the implementation of network functions in software 
that can then run on common hardware, but that can also be moved or instan- 
tiated at various network locations on demand. In order to achieve that, the 
available network, processing, and storage resources are to be configured based 
on policies from a central NFV orchestration system. One NFV topic closely 
related to QoE management that requires research attention deals with design- 
ing network-level QoE monitoring as a virtual function, which could be started 
“on-the-fly” on a commodity server. Other NFV aspects that need inspection 
relate to the orchestration process, which combines the operation of virtualized 
network functions. Here, one of the challenges is to efficiently merge different 
virtualized network-level functions so as to achieve a specific QOK management 
objective. 


4.3 Multi-access Edge Computing (MEC) 


MEC is another promising technology that fosters the closer collaboration of net- 
work operators and OTT parties, such as cloud, content and application providers, 
with the goal of efficiently maximizing QoE. MEC differs from NFV in terms of 
applications’ location (i.e. at the network edge), type (i.e. interfacing with the 
access network), and scope (i.e. mobility applications). Specifically, MEC repre- 
sents a technological paradigm, where network operators open up the Radio Access 
Network (RAN) edge of their networks to 3rd parties so that the latter can flexibly 
implement and offer novel services to their mobile customers, such as video analyt- 
ics and optimized local content distribution [1]. The ETSI body sees MEC as “the 
convergence of IT and telecommunications networking”. Similarly to SDN, MEC 
schemes will foster the joint, cross-layer QOE management for mobile subscribers, 
through authorizing the OTT players to exploit assets that exclusively belong to 
MNOs. An early example of this potential is found in [50], where a MEC server 
runs a novel adaptation algorithm enriched by the knowledge on wireless network 
congestion, which is provided by the MNO. This algorithm changes on the fly the 
HAS manifest files in response to the current network congestion, which drives end- 
user clients to select appropriate video segment representations that will diminish 
stallings and, thus, improve QoE. Similarly, Ge et al. in [28] guide the segment 
selection for video streaming users by locally caching the most popular content at 
the qualities that current network throughput can support. A reference architec- 
ture for the QoE-oriented management of services in the MEC ecosystem, exploit- 
ing Channel State Information (CSI), is discussed in [59]. 


5 Conclusion 


In this chapter we have discussed the current state of the art in QoOE management 
as well as the main challenges faced in this field. We presented a comprehensive 
framework for QoE-driven Network and Application Management (NAM) as well 
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as specific approaches and discussed future prospects of selected technologies in 
this field. 

In general, our survey shows that there are many promising sophisticated 
approaches towards QoE-driven NAM. However, we also observe that research in 
this field tends to be rather patchy (i.e. addressing just parts of the overall sys- 
tem or optimizing just for a specific service) or tends to remain on a high level of 
abstraction (i.e. being far away from practical implementability). Thus the main 
overarching challenges in realizing the vision of QoE-driven NAM are less related to 
singular components, but rather to putting all components (and the related stake- 
holders) together in a coordinated and sustainable fashion. In fact, a number of 
works have shown that the most promising solutions require cooperation between 
stakeholders typically facing conflicting business interests (see Sect. 2.1). Thus, we 
hope to see more future work addressing these non-technical, yet critical challenges 
by pointing out collaboration opportunities and value creation in the context of 
QoE-driven NAM. Furthermore, monitoring and managing QoE comes at the cost 
of increased overhead and complexity (data gathering, coordination and control, 
etc.), which is further catalyzed by the emerging SDN and NFV technologies. Such 
costs tend to be neglected in existing research, yet they represent a major barrier 
to adoption in practice. Therefore we encourage the community to stronger inte- 
grate these aspects (and complexity in particular) in the design and evaluation 
of QoE management approaches. This especially refers to a holistic exploitation 
of SDN, NFV and MEC that would unleash the potenital of a coordinated QoE 
management between the application and network levels. 

Finally, bringing QoE, AM and NM closely together also raises the need for 
aligned views and mindsets. Besides clarifying and synchronizing the meaning of 
different concepts and notions (like “quality”, “acceptability” or “performance” ), 
their importance for the different academic and industrial communities needs to 
be assessed and aligned in order to make the vision of truly QoE-driven Network 
and Application Management a reality. 
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Abstract. Conceptual and analytical models of an overall telecommunication 
system are utilized in this chapter for the definition of scalable indicators towards 
Quality of Service (QoS) monitoring, prediction, and management. The telecom- 
munication system is considered on different levels — service phase, service stage, 
network, and overall system. The network itself is presented in seven service stages 
— A-user, A-terminal, Dialing, Switching, B-terminal Seizure, B-terminal, and B- 
user, each having its own characteristics and specifics. Traffic quality indicators are 
proposed on each level. Two network cost/quality ratios are proposed — mean and 
instantaneous — along with illustrative numerical predictions of the latter, which 
could be useful for dynamic pricing policy execution, depending on the network 
load. All defined indicators could be considered as sources for Quality of Experi- 
ence (QoE) prediction. 


Keywords: Overall telecommunication system - Performance model 

Dynamic quality of service (QoS) - Telecommunication subservices 
Differentiated QoS subservice indicator - QoS prediction - Human factors of QoS 
Instantaneous Cost/Quality Ratio - Quality of Experience (QoE) 


1 Introduction 


Starting from 2010, e.g. [1], a new attitude towards the Quality of Service (QoS) has 
become dominant, namely to consider QoS and Quality of Experience (QoE) as goods, 
and the usage of Experience Level Agreement (ELA) [2] has started to be discussed. 
The importance of the teletraffic models, particularly of the overall QoS indicators, for 
QoE assessment is emphasized by Fiedler [3]. Until now, however, the usage of 
performance models of overall telecommunication systems was not very popular. This 
chapter utilizes the models, elaborated in the Chapter “Conceptual and Analytical 
Models for Predicting the Quality of Service of Overall Telecommunication Systems” 
of this book, for the definition of scalable QoS indicators towards overall 
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telecommunication system’s QoS monitoring, prediction, and management. Some indi- 
cators reflect predominantly the users’ experience. All defined indicators depend on 
human (users’) characteristics and technical characteristics, and may be considered as 
sources for QoE prediction. 

For this, in Sect. 2, traffic characterization of a service in a real device (service phase) 
is first elaborated. Definitions of served-, carried-, parasitic-, ousted-, and offered carried 
traffic are proposed, based on the ITU-T definitions, and eight service phase traffic 
quality indicators are proposed. 

In Sect. 3, the service stage concept is developed and corresponding traffic quality 
indicators are defined. 

In Sect. 4, telecommunication system and network efficiency indicators are proposed 
as follows: eight indicators — on the service stage level, five indicators — on the network 
level, and three indicators — on the overall system level. The relationship between indi- 
cators on the service stage-, network-, and system level are described. A comparison 
with classical network efficiency indicators is made. The applicability of the approach 
and results obtained for defining other indicators, as well as for numerical prediction of 
indicators’ values, is shown. 

In Sect. 5, two network cost/quality ratios are proposed — mean and instantaneous — 
and illustrative numerical predictions of the latter are presented, which may be useful 
for dynamic pricing policy execution, depending on the network load. 

In the Conclusion, possible directions for future research are briefly discussed. 


2 Service Phase Concept and Traffic Quality Indicators 


The conceptual model utilized in this chapter is described in detail in the Chapter 
“Conceptual and Analytical Models for Predicting the Quality of Service of Overall 
Telecommunication Systems” of this book. It consists of five levels: (1) overall tele- 
communication system and its environment; (2) overall telecommunication network; 
(3) service stages; (4) service phases; and (5) basic virtual devices. In the following 
subsections, the concepts of ‘service phase’ and ‘service stage’ are elaborated. 


2.1 Service Phase 


Based on the ITU-T definition of a service, provided in [4] (Term 2.14), i.e. “A set of 
functions offered to a user by an organization constitutes a service”, we propose the 
following definition of a service phase. 


Definition 1: The Service Phase is a service presentation containing: 


One of the functions, realizing the service, which is considered indivisible; 
All modeled reasons for ending/finishing this function, i.e. the causal structure of the 
function; 

e Hypothetic characteristics, related to the causal structure of the function (a well- 
known example of a hypothetic characteristic is the offered traffic concept). 
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Following the Structural Normalization and Causal Structure approaches described 
in the Chapter “Conceptual and Analytical Models for Predicting the Quality of Service 
of Overall Telecommunication Systems” of this book, we may present the service phase 
in device s by means of k + / basic virtual causal devices, each representing a different 
reason for ending this service phase (Fig. 1). 


Ousted (ost.Ys) 


Offered 
Carried | 
; (Oft.crr.Ys) | 


Fig. 1. Traffic characterization of a service phase, represented as device S, by means of k + 1 
basic virtual causal devices. 


In Fig. 1, only one causal device represents successful completion of the service in 
device s — with carried traffic (crr. Ys!), whereas the remaining causal devices represent 
k different reasons for unsuccessful ending of the service — respectively with traffics 
prs.Y1, prs.Y2, ...,prs.Yk. 

Generalizing, for more precise traffic characterization in a pool of resources, we 
propose the following definitions. 


Definition 2: The Served Traffic in a pool of resources is the traffic, occupying (using) 
resources in the pool. 
In Fig. 1, the served traffic in device s (srv.Ys) is the following sum: 


srv.Ys=prs.Yl+rs.Y2+ ... +prs.Yk+crr.Ys. (1) 


Definition 3: The Carried Traffic in a pool of resources is the traffic, which was success- 
fully served in the pool (and carried to the next service phase). 
In Fig. 1, the carried traffic in device s is crr.Ys. 


Definition 4: The Parasitic Traffic in a pool of resources is the traffic, which was unsuc- 
cessfully served in the pool. 


! In the expressions, formulas and figures, the sign (.) is used only as a separator and NOT as a 
sign of multiplication. The multiplication operation is indicated by a gap between multiplied 
variables, e.g. X Y. 


84 S. Poryazov et al. 


In Fig. 1, each of traffics prs.Y1, prs.Y2, ...,prs.Yk is a parasitic one. Parasitic 
traffic occupies real resources but not for an effective service execution. 

In Definitions 2 and 3, the served- and carried traffic are different terms, despite the 
ITU-T definition of the carried traffic as “The traffic served by a pool of resources” ([5], 
Term 5.5). We believe that this distinction leads to a better and more detailed traffic- 
and QoS characterization. 


Definition 5: The Ousted Traffic is the traffic that would be carried, if there is no unsuc- 
cessful service ending in the pool of resources. 

In Fig. 1, each parasitic traffic prs.Y1, prs.Y2, ...,prs.Yk ousts a corresponding 
traffic that would be carried, if there is no unsuccessful service ending of the corre- 
sponding type: ous.Y1, ous.Y2, ...,ous.Yk. The flow intensity to a parasitic device and 
the corresponding ousted device is the same by definition, i.e.: 


ous.Fi = prs.Fi, fori= [1,k], (2) 


but the service times are different. 
The hypothetic service time for every ousted device (ous.Ti) equals the carried 
service time (crr.Ts): 


ous.Ti = crr.Ts, fori= [1,k]. (3) 
The ousted traffic is a hypothetic one with the following intensity: 


ous.Yi = prs.Fi crr.Ts,  fori= [i,k]. (4) 


2.2 Causal Generalization 


In the Chapter “Conceptual and Analytical Models for Predicting the Quality of Service 
of Overall Telecommunication Systems” of this book, the causal presentation and causal 
aggregation are discussed. The causal aggregation is understood as an aggregation of 
all cases in the model, corresponding to different reasons for service ending (referred to 
as unsuccessful cases further in this chapter). 

Here a causal generalization is proposed, as an aggregation of all unsuccessful cases 
(prs. Ys). Besides this, an aggregation of all cases of ousted traffic (ous. Ys) could be used: 


k 

prs.Ys= > prs.Yi; (5) 
i=l 
k 

ous. Ys= 2 ous.Yi. (6) 


i=1 


By Definition 2, the served traffic is a sum of the parasitic and carried traffic (c.f. 
Fig. 1): 
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srv.Ys =prs.Ys + crr.Ys; (7) 
srv.Fs = prs.Fs +crr.Fs . (8) 


If the system is considered as being in a stationary state, by using the Little’s formula 
[6] we have: prs.Ys = prs.Fs prs.Ts and crr.Ys=crr.Fs crr.Ts . Hence: 


srv.Ys=srv.F's srv.Ts = prs.Fs prs.Ts + crr.Fs crr.Ts . (9) 


Formulas (7), (8), and (9) illustrate the advantage of the traffic qualifiers — the nota- 
tion is invariant to the number of cases considered in a service phase. 


2.3 Offered Carried Traffic 


Definition 6: The Offered Carried Traffic (ofr.crr.Ys) in a pool s of resources is the sum 
of the carried traffic (crr. Ys) and ousted traffic (ous. Ys) in the pool: 


ofr.crr.Ys = ous.Ys + crr.Ys. (10) 


k 
From (10), (6), (4), (8), prs.Fs = J, prs.Fiand crr.Ys = crr.Fs crr.Ts, the following 
formula could be obtained: i=l 


ofr.crr.Ys = srv.Fs crr.Ts. 


Definition 6 is analogous to the ITU-T definition of an Equivalent Offered Traffic [7] 
but considers the traffic related to the carried call attempts, whereas the ITU-T definition 
considers the traffic that would be served. 


2.4 Traffic Quality Indicators 


Indicator 1: Offered Carried Traffic Efficiency — the ratio of the carried traffic, in a 
service phase, to the offered carried traffic: 


crr.Ys ous.Ys 


(11) 


i= ofr.crr.Ys g ofr.crr.Ys` 
Indicator 2: Causal Ousted Importance — the ratio of the ousted traffic due to 
reason i (ous. Yi) to the offered carried traffic of a service phase (ofr.crr. Ys): 


ous.Yi 


L@= (12) 


ofr.crr.Ys ` 
This indicator allows the estimation of missed benefits due to reason i and therefore 
the necessity of countermeasures against this reason. 
Indicator 3: Ousted Traffic Importance — the sum of all causal ousted importance 
indicators of a service phase. From Fig. 1, and Formulas (6) and (11), it is: 
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` Yi Y. Y. 
ous.Yi ous. Ys crr.Ys 
L= ĴI =——___ _ = | - —___.. 13 
y 2) = Dae ofr.crr.Ys — ofr.crr.Ys ofr.crr.Ys a3) 


Indicator 4: Service Efficiency — the ratio of the carried traffic to the served traffic: 


_ crr.Ys _ Ta prs. Ys 
srv.Ys srv. Ys“ 


(14) 


Indicator 5: Causal Parasitic Importance — the ratio of the parasitic traffic due to 
reason i (prs. Yi) to the served traffic of a service phase (srv. Ys): 


LÒ = ——. (15) 


This indicator allows the estimation of an ineffective service due to a reason and 
therefore the necessity of countermeasures against this reason. 

Indicator 6: Parasitic Traffic Importance — the sum of all causal parasitic impor- 
tance indicators of a service phase. From Fig. 1, and Formulas (5) and (14), it is: 


k k . 
. prs.Yi prs.Ys crr.Ys 
l= I.(i) = = = 1] -——. 16 
£ 2 3) 2 srv.Ys  srv.Ys srv. Ys (te) 


Indicator 7: Ousted/Parasitic Traffic Ratio — this is the ratio of the ousted traffic 
to the parasitic traffic: 


_ ous.Ys 
r= 


prs.Ys` (17) 
This indicator estimates the aggregated, by all reasons, ratio of missed benefits to 
the ineffective service in a service phase. 
Indicator 8: Causal Ousted/Parasitic Traffic Ratio — this is the ratio of the ousted 


traffic, due to reason i, to the parasitic traffic due to the same reason. From Definition 5 
and Formula (2): 


; ous.Yi _ ous.Ti 
14 = n= > (18) 
prs.Yi prs.Ti 


This indicator gives another important estimation of a reason for ineffective service 
in a service phase. 


3 Service Stage Concept and Traffic Quality Indicators 


Definition 7: The Service Stage is a service presentation containing: 


e One service phase, realizing one function of the service; 
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e All auxiliary service phases that directly support this function realization but are not 
part of the realized function itself. 


Examples of auxiliary service phases are the entry, exit, buffer, and queue virtual 
devices. The performance of the auxiliary devices depends directly on the service phase, 
realizing a function of the service. 

The service stage concept allows the division of the overall telecommunication 
service into subservices and therefore makes easier the system modeling process. 


3.1 Service Stage 


For simplicity in this subsection, the simplest possible service stage, consisting of only 
two service phases, is considered (Fig. 2). For more complex service stages with more 
phases, please refer to the Chapter “Conceptual and Analytical Models for Predicting 
the Quality of Service of Overall Telecommunication Systems” of this book. 


ae a 


ofr.crr.Ye 


ly eee 


ofr.crr.Ys | 


Served 
(srv.Ye) 


Entrance 
Device 


Service 
Device 


prs.Fg 


Fig. 2. A service stage g, consisting of Entrance and Service phases. 


The service stage g, in Fig. 2. consists of Entrance and Service phases (represented 
by corresponding virtual devices). The Entrance device (e) may check the service request 
(call) attempt for having the relevant admission rights, whereas the Service device (s) 
checks for service availability or existence of free service resources, etc. Let ofr. Fg is 
the flow intensity of the service request attempts offered to this stage, crr.Fg — the inten- 
sity of the outgoing carried flow, prs.Fg — the flow intensity of the parasitic served 
requests, and prs.Fe — the intensity of the parasitic call attempts flow. Then from Fig. 2 
we have: 
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prs.Fe = ofr.Fg prs.Pe, (19) 


where prs.Pe is the probability of directing the service request attempts to the general- 
ized parasitic service in device e. By analogy: 


prs.Fs = ofr.Fg (1 — prs.Pe) prs.Ps. (20) 
The total parasitic flow in service stage g is: 
prs.Fg = prs.Fe + prs.Fs. (21) 


The carried traffic (crr. Yg) in service stage g is a sum of the carried traffic in devices 
eand s: 


crr.Yg = crr.Ye + crr.Ys = crr.Fe crr.Te + crr.Fs crr.Ts, (22) 

where: 
crr.Fe = ofr.Fg (1 — prs.Pe); (23) 
crr.Fs = ofr.Fg (1 — prs.Pe) (1 — prs.Ps). (24) 


The total carried traffic in service stage g is: 
crr.Yg = ofr.Fg (1 — prs.Pe) ( crr.Te + (1 — prs.Ps) crrTs). (25) 


The estimation of the carried traffic in a service stage could be problematic due to 
the fact that some of the carried service requests attempts in the first device (e) are not 
carried to the next device (s), i.e. they become parasitic service requests with probability 
prs.Ps (c.f. Fig. 2). 

Based on the ITU-T definition of ‘effective traffic’ [5], i.e. as “The traffic corre- 
sponding only to the conversational portion of effective call attempts”, we propose here 
the Effective Carried Traffic concept. 


Definition 8: The Effective Carried Traffic in a service stage is the traffic corresponding 
to the service request attempts leaving the stage with a fully successful (carried) service. 
In Fig. 2, the effective carried traffic (eff:crr.Yg) of service stage g is: 


eff .crr.Yg = eff.crr.Fg eff.crr.Tg, (26) 

where: 
eff.crr.Fg = crr.Fs = ofr.Fg (1 — prs.Pe)( — prs.Ps); (27) 
eff .crr.Tg = crr.Te + crr.Ts. (28) 


From (26), (27), and (28), we obtain: 


eff .crr._Yg = ofr.Fg (1 — prs.Pe)(\ — prs.Ps)(crr.Te + crr.Ts). (29) 
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Note the difference between (25) and (29), i.e. in general, in a service stage, the 
effective carried traffic is less than the carried traffic. 

The Offered Traffic is a fundamental teletraffic engineering concept. We use the 
ITU-T definition of the Equivalent Offered Traffic [7], i.e. “Offered traffic, to a pool of 
resources, is the sum of carried and blocked traffic of this pool”. 

The blocked traffic corresponds to the blocked attempts, as per Definition 2.8 in [5]: 
“Blocked call attempt: A call attempt that is rejected owing to a lack of resources in the 
network”. This definition, however, is too narrow to be applied directly to blocked 
service request attempts as it does not include most of the reasons for rejection, including 
access control, service unavailability, called terminal unavailability or busyness, and 
many others. Thus we propose the following extension of it. 


Definition 9: The Blocked Service Request Attempt is a service request attempt with 
rejected service, in the intended pool of resources, due to any reason. 

Blocked traffic is a service stage concept because it considers blocking of service 
requests before entering the service phase, or in other words, blocking that occurs in 
another virtual device before the corresponding service device. 

In Fig. 2, blocking occurs in the Entrance device. The blocked traffic (bic. Ys) corre- 
sponds to the service request attempts offered to service stage g (ofr.Fg), but not 
belonging to the served attempts (srv.F's). From Fig. 2 and the Little’s theorem, we 
obtain: 


blc.Ys = ble.Fs_ blc.Ts. (30) 


The service request attempts that are not carried in phase e, and hence are rejected 
to the next service phase, are considered parasitic in phase e. For the intensity of the 
blocked attempts, the following equality holds: 


bic.Fs = prs.Fe = ous.Fe = ofr.Fg prs.Pe. (31) 


The offered traffic is a hypothetic one “that would be served” if it is not blocked, and 
therefore: 


blc.Ts = srv.Ts (32) 
From (30), (31), and (32), we obtain the following formula: 
bic.Ys = ofr.Fg prs.Pe srv.Ts, (33) 


which is valid for the generalized reason for service request attempts rejection in service 
phase e (i.e. in the Entrance device). 

From the definition of the equivalent offered traffic, Fig. 2, and Formulas (23), (33), 
and srv.Fs = crr.Fe, the traffic offered to the service device s is: 


ofr.Ys = blc.Ys + srv.Ys = ofr.Fg srv.Ts. (34) 
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3.2 Traffic Quality Indicators 


Many of the service-stage traffic quality indicators may be reformulated as service-stage 
performance indicators as done below. 

Indicator 9: Carried Effectiveness of a Service Stage — the ratio of the effective 
carried traffic to the carried traffic: 


.crr.Y 
pe eff .crr (35) 


crr.Y 


For instance, from (25) and (29), the Carried Effectiveness of service stage g in 
Fig. 2 is: 


eff .crr.Yg _ (1 — prs.Ps)(crr.Te + crr.Ts) 
crr.Yg _crr.Te + (1 — prs.Ps) crr.Ts` 


(36) 


4 Telecommunication System and Network Efficiency Indicators 


4.1 Telecommunication System QoS Concept 


Users are shown in “Fig. 1 — Schematic contributions to end-to-end QoS” in [4] but they 
are not connected to the network. In Fig. 3, schematic contributions to QoS in an overall 
telecommunication system, including users, is presented in more detail. 
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Fig. 3. Schematic contributions to QoS in an overall telecommunication system, including users. 


In Fig. 3, the calling (A) and called (B) users and terminals, as well as the main 
service stages of the service request (call) attempts in a telecommunication network, are 
presented. The telecommunication network is usually presented as having five service 
stages: A-terminal, Dialing, Switching, B-terminal Seizure, and B-terminal, where each 
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service stage has its own characteristics. However, there are two other stages — A-user 
and B-user —, with their own specifics. 
In Fig. 3, the possible paths of service request (call*) attempts are the following: 


1. int.Fa: The calling users (A-users) generate intent call attempts, with intensity 
int.Fa, represented as a Generate device in the A-User block. Call intent is “The 
desire to establish a connection to a user”. “This would normally be manifested by 
a call demand. However, demands may be suppressed or delayed by the calling 
user’s expectation of poor Quality of Service performance at a particular 
time” [5]. 

2. sup.Fa: The intensity of suppressed intent call attempts. Suppressed traffic is “The 
traffic that is withheld by users who anticipate a poor quality of service (QoS) 
performance” [5]. “At present, suitable algorithms for estimating suppressed traffic 
have not been defined” [7]. 

3. dem.Fa: The intensity of demand call attempts. Call demand is: “A call intent that 
results in a first call attempt” [5]. 

4. rep.Fa: The intensity of repeated call attempts. Repeated call attempt is: “Any of 
the call attempts subsequent to a first call attempt related to a given call demand. 
NOTE - Repeated call attempts may be manual, i.e. generated by humans, or auto- 
matic, i.e. generated by machines” [7]. 

5. ofr.Fa: The intensity of all call attempts (demand and repeated) trying to occupy 
A-terminals. A-terminals are considered as the first service stage (c.f. Sect. 3) in 
the telecommunication network. From Fig. 3: 


ofr.Fa = dem.Fa + rep.Fa. (37) 


6. prs.Fa: The intensity of all parasitic (unsuccessfully served, c.f. Sect. 2) call 
attempts in A-terminals. 

We are modeling the system in a stationary state and for each considered service 
stage the intensity of the offered call attempts equals the sum of the outgoing para- 
sitic and carried flows, e.g. ofr.Fa = prs.Fa + crr.Fa. 

For each service stage, part of the parasitic attempts are terminated by the A-user 
(c.f. devices of type ‘terminator’ in Fig. 3) and the rest join the repeated attempt’s 
flow (rep.Fa). 

7. ofr.Fd = crr.Fa: The intensity of carried (in A-terminals) call attempts (crr.Fa) is 
equal to the intensity of the offered call attempts (ofr.Fd) to the Dialing stage in the 
network. 

8. prs.Fd: The intensity of all parasitic (unsuccessfully served, c.f. Sect. 2) call 
attempts in the Dialing stage. 

9. ofr.Fs = prs.Fs + crr.Fs: The intensity of the offered-, parasitic-, and carried 
flows of call attempts of the Switching stage. 


i Throughout the rest of this chapter, the term ‘call’ should be interpreted in a broader meaning 
of a ‘service request’. 
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10. ofr.Fz = prs.Fz + crr.Fz: The intensity of the offered-, parasitic-, and carried 
flows of call attempts of the ‘B-terminal seizure’ stage. The intended B-terminal 
may be busy or unavailable and this will cause blocking of call attempts. 

11. ofr.Fb = prs.Fb + crr.Fb: The intensity of the offered-, parasitic-, and carried 
flows of call attempts of the B-terminal stage. 

12. ofr.Fbu = prs.Fbu + crr.Fbu: The intensity of the offered-, parasitic-, and 
carried flows of call attempts of the B-user stage. The B-user may be absent, busy, 
tired, etc. 


4.2 Efficiency Indicators 


The efficiency indicators, proposed in this chapter, are considered on five levels: (1) 
service phase; (2) service stage; (3) part of network; (4) overall network; and (5) overall 
telecommunication system. 


4.2.1 Proposed Efficiency Indicators on Service Stage Level 
In each service stage, a basic performance indicator is the ratio between intensities of 
the carried flow and offered flow of call attempts. An exception is the A-User stage 
because there are two sub-stages in it — Ai (considering the intent call attempts) and Ad 
(considering the demand call attempts). 

Indicator 10: Efficiency indicator Qai on the Ai sub-stage: 


.  dem.Fa 
lo = Qai = 
w= 2 int.Fa 


(38) 


Indicator 11: Efficiency indicator Qad on the Ad sub-stage. 
Let Pr is the aggregated probability of repetition of the offered (to the A-terminals) 
call attempts: 


Pr= rep.Fa 
= ofr.Fa™ (09) 


From (37) and (39), the following formula could be obtained for the efficiency indi- 
cator Qad: 


dem.Fa dem.Fa 1 
I = 1 = P = =Z = es d, 
u=( i ofr Fa dem.Fa+repFa Bp Qa (40) 
where £ is defined in [7] as: 
All call attempts 
(41) 


~ First call attempts ` 


In (40), Qad is de-facto the probability corresponding to the ratio of the primary 
(demand) call attempts’ intensity to the offered attempts’ intensity. It may be considered 
as an aggregated overall network performance indicator (as per the initial attempt in [8]). 


Indicator 12: 


Indicator 13: 


Indicator 14: 


Indicator 15: 


Indicator 16: 


Indicator 17: 
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Efficiency indicator Qa on the A-terminal stage: 


Efficiency indicator Qb on the B-terminal stage: 


b= crr.Fb 


I =0b= : 
1$ ofr.Fb 


Efficiency indicator Qbu on the B-user stage: 


_ crr.Fbu 


I; =Q0b 
ye ofr.Fbu 


4.2.2 Proposed Efficiency Indicators on Network Level 
Network efficiency indicators estimate QoS characteristics of portions of the network 
comprising more than one service stages, or the overall network. In this subsection, as 
usually, the indicated network portion begins with the starting points of the network and 
ends in another network point of interest. All network efficiency indicators are fractions 
with denominators offered to the A-terminals’ flow intensity ofr. Fa. 

The classic network efficiency indicators are the following three, e.g. as defined in 


[9]: 
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(42) 


(43) 


(44) 


(45) 


(46) 


(47) 


1. “Answer Seizure Ratio (ASR) = (number of seizures that result in an answer 
signal)/(the total number of seizures)” ... “Measurement of ASR may be made on a 
route or on a destination code basis” ... “A destination can be a mobile network, a 
country, a city, a service, etc.” [9]. 

2. “Answer Bid Ratio (ABR) = (number of bids that result in an answer signal)/(total 
number of bids); ABR is similar to ASR except that it includes bids that do not result 
in a seizure” [9]. 
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3. “Network Effectiveness Ratio (NER): NER is designed to express the ability of 
networks to deliver calls to the far-end terminal. NER expresses the relationship 
between the number of seizures and the sum of the number of seizures resulting in 
either an answer message, or a user busy, or a ring no answer, or in the case of ISDN 
a terminal rejection/unavailability. Unlike ASR, NER excludes the effects of 
customer behavior and terminal behavior” [9]. 


These classic efficiency indicators reflect network providers’ attitude but don’t 
consider the possibilities for initiated but unsuccessful communication as well as the 
influence of repeated attempts. 

Below we propose new network efficiency indicators, all having as an index the first 
letter of the last service stage considered. 

Indicator 18: Network efficiency indicator Ea on the A-terminal stage, c.f. also 
(42): 


l = Ea = — = Qa= 1). (48) 


Indicator 19: Network efficiency indicator Ed on the Dialing stage, c.f. also (43): 


ine crr.Fd 


S ofrFa 7 24 Qd = ahs a) 


Indicator 20: By taking into account that crr.Fd = ofr.Fs, c.f. Fig. 3 and (44), the 
network efficiency indicator Es on the Switching stage is: 


crr.Fs 
ofr.Fa 


In) = Es = = Qa Qd Qs = l4 li. (50) 


Indicator 21: By taking into account that crr.Fs = ofr.Fz, c.f. Fig. 3 and (45), the 
network efficiency indicator Ez on the ‘B-terminal seizure’ stage is: 


h, = Ez= crr.Fz 


= ofr.Fa = Qa Qd Qs Qz = lis ho. (51) 


Indicator 22: By taking into account that crr.Fz = ofr.Fb, c.f. Fig. 3 and (46), the 
network efficiency indicator Eb on the B-terminal stage is: 


crr.Fb 
I, = Eb = —— = Qa Qd Qs Qz Qb =1,¢1,,. (52) 
ofr.Fa 
This indicator corresponds to the cases of B-user answers, but does not consider the 
successfulness of the communication. 


4.2.3 Proposed Efficiency Indicators on Overall System Level 
Indicator 23: By taking into account that crr.Fb = ofr.Fbu, c.f. Fig. 3 and (47), the 
system efficiency indicator Ebu on the B-user stage is: 
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crr.Fbu 
ofr.Fa 


L = Ebu = = Qa Qd Qs Qz Qb Qbu = I, l}. (53) 


This indicator corresponds to the cases of fully successful communication, from the 
users’ point of view, regarding all call attempts offered to the network. 
Indicator 24: System efficiency indicator Eu on the Ad sub-stage, c.f. also (40): 


I, = Eu = Qad Ebu = Qad Qa Qd Qs Qz Qb Qbu = L; l}. (54) 


This indicator corresponds to the cases of fully successful communication, from the 
A-users’ point of view, regarding demand call attempts. It shows what part of the first 
(demand) attempts is fully successful. It may be called ‘Demand Efficiency’. It is a user- 
oriented indicator, compounding explicitly repeated attempts, connection and commu- 
nication parameters. 

Indicator 25: System efficiency indicator Ei on the Ai sub-stage, c.f. also (38): 


l; = Ei = Qai Eu = Qai Qad Qa Qd Qs Qz Qb Qbu = Lio l4. (55) 


This indicator corresponds to the cases of fully successful communication, from the 
A-users’ point of view, regarding intent call attempts. It shows what part of the intent 
attempts is fully successful. It is very difficult to measure Ei directly because suppressed 
attempts (forming the demands w.r.t. point 2 in Subsect. 4.1) can’t reach the network 
and therefore can’t be measured there. 


4.3 Approach Applicability and Results 


Most of the proposed indicators are flow-oriented as they take into account the flow 
intensities. Flow-oriented indicators are in the core of time- and traffic-oriented indica- 
tors. In this subsection, numerical results for some of the proposed flow indicators and 
other time- and traffic-oriented indicators, built on their basis, are presented. An analyt- 
ical model of the overall telecommunication system, corresponding to Fig. 3, is used. 
Methods of building such models are described in the Chapter “Conceptual and Analyt- 
ical Models for Predicting the Quality of Service of Overall Telecommunication 
Systems” of this book. 

The numerical results are presented for the entire theoretical network-traffic-load 
interval, i.e. the terminal traffic of all A- and B-terminals (Yab) is ranging from 0% to 
100% of the number Nab of all active terminals in the network. The input parameters 
are the same, excluding the capacity of the network (the number of the equivalent 
connection lines), given as a percentage of all terminals in the system. Differences in 
the network capacity cause different blocking probabilities due to resource insufficiency. 
Three cases have been considered: 


e Case 1: Without repeated service request attempts and without blocking; 
e Case 2: With repeated service request attempts but without blocking; 
e Case 3: With repeated service request attempts and with blocking. 
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Figures 4, 5, 6 present numerical results obtained for some of the proposed efficiency 
indicators, whereas Figs. 7, 8, 9 present numerical results obtained for some time- and 
traffic-oriented indicators. 
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Fig. 4. Efficiency indicators Qad, Ebu and Eu for Case 1 (Qad = land Ebu = Eu because there 
are no repeated service request attempts in the system). 


100 = 
J Case 2: With repeated service 
90 = request attempts 
El but without blocking 
80 = 
ö 70 E 
® 60 = 
eae 
8 502 
40 4 
: by 
30 -4 
20 4 Euy 
10 4 
oa 
O 10 20 30 40 50 60 70 80 90 100 


Network Traffic Load (Yab/ Nab) [%] 


Fig. 5. Efficiency indicators Qad, Ebu and Eu for Case 2 (the network performance is degraded 
considerably due to repeated service request attempts). 
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Fig. 6. Efficiency indicators Qad, Ebu and Eu for Case 3 (the network performance degrades 
sharply due to blocking). 
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Fig. 7. Time and traffic AB-efficiency for Case 1 ((Ebu Tec)/Tab = (Eu Tcc)/Tab, because there 
are no repeated service request attempts). 
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Fig. 8. Time and traffic AB-efficiency for Case 2 ((Eu Tcc) / Tab is sensitive to repeated service 
request attempts in contrast to (Ebu Tcc) / Tab, which is not). 
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Fig. 9. Time and traffic AB-efficiency for Case 3 (the indicator paid Traffic / Yab is not sensible 
enough in the network-load interval without blocking). 


5 Network Cost/Quality Ratios 


We consider the overall telecommunication system model, presented in Fig. 3, with the 
following assumptions: 
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Assumption 1: The observation time interval At is limited; 

Assumption 2: The full system costs (SC) in the time interval At are known; 
Assumption 3: The cost/quality ratio depends on the paid volume of traffic (paid. V) in 
this time interval and the QoS indicator (Q); 

Assumption 4: The full system costs (SC) don’t depend considerably on the served 
traffic volume in the time interval Af; 

Assumption 5: The QoS indicator (Q) is dimensionless with values from the interval 
(0,1] and is proportional to the quality (Q = 1 means ‘the best quality’). 


5.1 Mean Cost/Quality Ratio 


Based on these assumptions and the definition of the traffic volume, i.e. “The traffic 
volume in a given time interval is the time integral of the traffic intensity over this time 
interval” [5], the ‘Cost per Unit’ quantity is: 


Full System's Costs [Euro] 


Cost Unit = —— a. 
Ost per UMU Paid Traffic Volume [Erlang x At] O5) 
By dividing this to the QoS indicator (Q), we obtain the following: 
Cost per Unit _ Full System's Costs [Euro] _ SC J 
Quality Q paid.V [Erlang X At] Q paid. V` OD 


The definition of the paid traffic may depend on the telecommunication service 
provider. The estimation of the paid traffic volume is a routine operation (c.f. ITU-T 
Recommendations Series D: General Tariff Principles). 

The definition of the QoS indicator (Q) may differ from users’ perspective (i.e. as a 
generalized QoE parameter) to the telecommunication service provider’s perspective. 
In general, the best is to include the Q definition in the Service Level Agreement (SLA). 
In any case, the value of the QoS indicator (Q) in (57) is the mean value in the time 
interval considered. 

The mean cost/quality ratio (57) is suitable for relatively long intervals — days, 
months, years. 


5.2 Instantaneous Cost/Quality Ratio 


We consider the traffic intensity (Y) as per the ITU-T definition, i.e. “The instantaneous 
traffic in a pool of resources is the number of busy resources at a given instant of time” 
[5]. From assumptions made and (57), the following formula could be obtained: 


Cost per Unit _ Full System's Costs [Euro] _ SC 58 
Quality At [Time] paid.Y [Erlang] Q At Q paid.Y” (38) 


The paid traffic intensity (paid.Y) is an instantaneous quantity but the ratio ‘Cost per 
Unit/Quality’ (58) depends on the time interval duration. We define the ‘System’s Costs 
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Intensity’ (SCI) parameter, independent of the time interval duration (but dependent of 
the interval position in the service provider’s life time), as per the following formula: 


Full System's Costs [Euro] _ SC 


System's Costs Intensity (SCI) = Ar [Time] ae 


(59) 


The System’s Costs Intensity (SCJ) parameter allows defining a new useful param- 
eter — the Normalized Cost/Quality Ratio (VCQR): 


Normalized Cost / Quality Ratio (NCQR) = ETT (60) 


The Normalized Cost/Quality Ratio (VCQR) is independent of the absolute system’ s 
costs amount. It is normalized, because it is the cost/quality ratio per 1 Euro cost. 
From (57), (58), and (59), we obtain: 


Cost per Unit _ Full System's Costs [Euro] 1 
Quality At [Time] Q paid.Y [Erlang] (61) 
= SCI NCQR 


The proposed quantities SCI and NCQR allow the estimation of the cost/quality ratio 
for every suitable (paid) time interval with a relatively short duration, e.g. seconds, 
minutes, hours. 

The paid traffic intensity depends on the network traffic load. In any case, the instan- 
taneous values of the QoS indicator (Q) depend on many factors, including the network 
load. 

The expressions (57) and (58) are similar (the mean value of the instantaneous indi- 
cator, in Aż, gives the value of the Mean Cost/Quality Ratio indicator in Aż), but the 
methods for their estimation and usage are different. 

The Instantaneous Cost/Quality Ratio may be useful for dynamic pricing policies, 
depending on the network load. Related works on this subject were not found in the 
literature. 


5.3 Prediction of Instantaneous Cost/Quality Ratio 


An advantage of the Normalized Cost/Quality Ratio (VCQR) is its independence of the 
absolute system’s costs amount. This allows separation of the estimations for NCQR 
and System’ Cost Intensity (SCI). In this subsection, we estimate NCQR using the tele- 
communication system model described in the Chapter “Conceptual and Analytical 
Models for Predicting the Quality of Service of Overall Telecommunication Systems” 
of this book. 

We consider Q as an overall telecommunication system’s QoS indicator. Each of the 
described indicators on the overall system level (c.f. Subsect. 4.2.3) may be used. 
Numerical examples below are for the indicator Q = Ebu c.f. (53). This corresponds 
to cases of fully successful communication, from the users’ point of view, regarding all 
call attempts, offered to the network. Sometimes it is called “Network Call Efficiency”. 


Scalable Traffic Quality and System Efficiency Indicators 101 


As a paid traffic, the successful communication (carried) traffic is used: 


1 


NCQR = ——_.. 
Q Ebu paid.Y 


(62) 


The values of input parameters of human behavior and technical system, to the 
model, in the presented output numerical results are typical for voice-oriented networks. 

Figures 10 and 11 present numerical results for the entire theoretical network traffic 
load interval, i.e. the terminal traffic of all A- and B-terminals (Yab) is within the range 
of 0% to 100% of the number Nab of all active terminals in the system. The input 
parameters are the same, excluding the capacity of the network (the number of equivalent 
connection lines), given as a percentage of all terminals in the system. Differences in 
the network capacity cause different blocking probabilities due to resource insufficiency. 
Two cases have been considered: 


e Case 1: The network capacity equals 10% of all terminals presented in the system; 
e Case 2: The network capacity equals 25% of all terminals presented in the system. 
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Fig. 10. Numerical prediction of the Normalized Cost/Quality Ratio (VCQR), Network Call 
Efficiency (Ebu), and Paid Traffic Intensity in an overall telecommunication system with QoS 
guarantees (Case 1: Network capacity = 10% of terminals). 
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Fig. 11. Numerical prediction of the Normalized Cost/Quality Ratio (VCQR), Network Call 
Efficiency (Ebu), and Paid Traffic Intensity in an overall telecommunication system with QoS 
guarantees (Case 2: Network capacity = 25% of terminals). 


The results show considerable sensitivity of the Normalized Cost/Quality Ratio 
(NCQR) from the network capacity and traffic load. 


6 Conclusion 


In this chapter, a more detailed and precise approach to the traffic characterization has 
been taken, which allows the definition of new efficiency indicators on different levels, 
starting with the service phases and stages, continuing with (part of) the network, and 
finishing with the overall telecommunication system. The proposed Instantaneous Cost/ 
Quality Ratio may be used for the establishment and utilization of dynamic pricing 
policies, depending on the network load. The use of the instantaneous cost/quality ratios 
as a source for QoE prediction is a very perspective direction of research. Using similar 
approaches and specific QoS indicators for other types of networks, e.g. multimedia- 
and multiservice networks, seems also very topical. 
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Abstract. Cloud gaming is an emerging technology that combines cloud 
computing with computer games. Compared to traditional gaming, its core 
advantages include ease of development/deployment for developers, and lower 
technology costs for users given the potential to play on thin client devices. In 
this chapter, we firstly describe the approach, and then focus on the impact of 
latency, known as lag, on Quality of Experience, for so-called First Person 
Shooter games. We outline our approach to lag compensation whereby we 
equalize within reason the up and downlink delays in real-time for all players. 
We describe the testbed in detail, the open source Gaming Anywhere platform, 
the use of NTP to synchronise time, the network emulator and the role of the 
centralized log server. We then present results that firstly validate the mecha- 
nism and also use small scale and preliminary subjective tests to assess and 
prove its performance. We conclude the chapter by outlining ongoing and future 
work. 


Keywords: Cloud gaming - Quality of experience - Network delay 
Lag compensation 


1 Introduction 


The gaming industry plays an important role in the entertainment and software 
industries. According to “Video Game Revenue Forecast: 2017-22”, it is expected that 
the global market of video games will grow up to $174 billion by 2022 [1]. 
Traditionally, computer games are downloaded from the Internet and installed on a 
PC or other end user device allowing players to run the corresponding game. With 
game sizes running into multiple gigabytes, the installation process may take the order 
of hours, with perhaps additional time required to install patches of new game versions. 
Furthermore, when players wish to play newly released games, they may require a 
higher specified hardware configuration to enable all the visual effects, and so they 
have to upgrade their computers to meet the particular specification. Both of these 
factors can result in frustration and may result in gamers give up the game [2]. 


© The Author(s) 2018 
I. Ganchev et al. (Eds.): Autonomous Control for a Reliable Internet of Services, LNCS 10768, pp. 104-127, 2018. 
https://doi.org/10.1007/978-3-3 19-904 15-3_5 
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Unlike conventional computer games, cloud gaming has a different paradigm. The 
‘heavy lifting’ of game processing is done by servers in the cloud [3]. Game actions are 
captured by game clients and sent to cloud server(s). The resulting game scenes are 
rendered by the cloud servers and the audio and video frames are streamed back to 
clients over the broadband network. Gamers thus interact and control games through 
thin clients, the thin client being a lightweight process (often a browser) which interacts 
with the remote server [2]. Figure 1 shows the relationship between the server and 
client in a cloud gaming service, with gamer actions captured and sent to the 
cloud-based gaming provider which then streams a video back to the client. For this 
reason, cloud gaming allows gamers to play games with simple devices (also referred 
to as thin client devices) without having to install the games or to continuously upgrade 
computer hardware. 


Rendered Video 
Stream (Mbps) 


[Easy 


——— | 


Player actions 
(bps) 


Cloud 
Servers Thin Clients 


Fig. 1. Cloud gaming service 


For these reasons, game developers and users/gamers are paying more attention to 
cloud gaming systems [3]. From the developer perspective, the benefits include the 
potential of reaching out to more gamers, easier testing of ideas to improve cloud 
gaming systems, avoiding piracy due to the fact that games are not being downloaded 
to client devices [4]. 

From the gamer perspective, the benefits include access to games anytime (on 
demand), without the need to download games, reduced costs due to the fact that the 
computer hardware does not need to be upgraded frequently, and ability to play games 
on different platforms, such as PC, smartphone, tablet and so on. 
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Although cloud games open up a new direction for the video games industry, it is 
not without its challenges [2]. According to previous research, not only the bandwidth 
but also the CPU has a significant influence on cloud gamers’ Quality of Experience 
(QoE) [5]. Ideally, gamers would like to play games with both high quality videos, and 
where games are delay sensitive, low latency. Latency, also known as lag in gaming, is 
especially important in the First Person Shooter (FPS) game genre. However - high 
quality videos, for example, 720p/1080p at 50 fps, can make cloud gaming systems 
vulnerable to a high network latency [3] as much more network capacity is needed than 
in the case of conventional video games (e.g., 5000 kb/s vs. 50 kb/s) [3]. To meet the 
needs of cloud gamers, network service providers thus have to take network latency, 
efficiency, high video quality, and error resiliency into consideration [6]. These factors 
represent significant challenges to the roll out of large scale cloud gaming services. 
Without adequate infrastructure that meets the specific needs of cloud gaming, the 
potential and benefits will not be realised. 

In this chapter, we focus on the impact of lag on QoE for so-called FPS games. In 
multi-player scenarios, different lag values experienced among players can lead to 
unfair game play and frustration among players [7]. In [8], the authors report findings 
that state that different QoS leads to unfairness or imbalanced games when there are no 
mechanisms for mitigating the QoS differences. Previous research has analysed the 
potential of achieving fairness in multi-player networked games through automated 
latency balancing [9]. We further tackle these challenges in the context of cloud 
gaming. We describe in detail our approach to lag compensation whereby we equalise 
within reason the up and downlink delays in real-time for all players, aiming to achieve 
fairness among players and consequently improve QoE. 

At present, several cloud service providers have developed cloud gaming platforms, 
most of which are closed source (e.g., Sony PlayStation Now, NVIDIA GeForce Now). 
Therefore, game developers cannot test and fine-tune their games on them [3] and they 
have to do the tests and fine-tuning of the games on emulators. This fact increases the 
difficulty of improving cloud gaming systems to better reflect gamers’ needs and 
expectations. 

In response to this, Gaming Anywhere (GA) was designed and developed. It is the 
first open-source cloud gaming platform that allows researchers to quickly explore their 
ideas. More importantly, GA has greatly promoted the evolution of cloud gaming 
within the video game industry [3]. 

The remainder of this chapter is structured as follows. Section 2 provides a liter- 
ature survey on gaming QoE. It firstly deals with the impact of latency and packet loss 
on game QoE, before focusing specifically on latency. It then examines the impact of 
delay for different game genres. Section 3 introduces the concept of lag compensation 
and outlines our two research objectives. Section 4 outlines in detail how lag equali- 
zation was implemented on the Gaming Anywhere platform in order to address both 
research objectives. Section 5 presents results that firstly validate the lag compensation 
mechanism and then outlines the results of preliminary subjective tests. Section 6 
concludes the paper by outlining ongoing and future work. 
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2 Literature Review 


A key research challenge has been to determine the impact of a wide range of influence 
factors on gaming QoE, including a wide range of human, system, and context factors 
[10-12]. Focusing on system influence factors, and as outlined in the previous section, 
cloud gaming demands a high level of network Quality of Service (QoS) to deliver 
acceptable user perceived quality (QoE) to players. Key QoS-related factors include 
packet loss and delay, with their impact on QoE differing for different types (genres) of 
games. 

In this section, we review relevant literature that outlines the impact that both 
roundtrip delay (latency), also known as Response Delay (RD) or lag, and packet loss 
have on the end user experience in general, and also for different game genres. As the 
main focus of this chapter is on lag compensation, we focus more on lag in the 
literature review. 

Given the inherently interactive nature of gaming in general, a key challenge is 
meeting delay requirements. This involves the delivery of both user control inputs 
(mouse, keyboard strokes) to the game server and uninterrupted presentation of con- 
tinuous game content to players, transmitted in the form of a video stream. For this 
reason, conventional methods for diminishing the effects of poor network conditions 
and consequent jitter on streaming media, such as buffering data for display, are not 
readily applicable in the context of cloud gaming. Moreover, lag compensation tech- 
niques applicable in “traditional” gaming, such as client-side prediction [13], are not 
applicable in the context of cloud gaming, where the client is simply decoding and 
portraying the stream received from the server. Consequently, numerous studies have 
addressed the impact of latency due to heterogeneous and variable network conditions 
on the end user QoE of cloud gaming. 

As reported in [5, 14—16], packet loss and delay have a significant impact on cloud 
gaming QoE. Basically, network congestion results in network delay jitter and when 
queue size is exceeded, packet loss occurs. Delay and jitter impact both on uplink time 
between the player sending input events to the server, and downlink transmission of 
game scenes that are eventually displayed on the screen. Moreover, as it has been 
shown in [17], a high network delay disrupts an interaction between server and players 
and negatively influence players’ QoE. 

It is important to note that not all cloud games are equally sensitive to latency, as is 
of course also the case for “traditional” networked games [18]. For Real-time Strategy 
(RTS) games, the process of constructing buildings or moving troops towards a bat- 
tlefield is unaffected by latency as high as 1000 ms [15]. However, First Person 
Shooter (FPS) games, where users are shooting at a moving target tend to be more 
sensitive to latency with delays of over 100 ms seen as unacceptable [19]. Moreover, 
the effects of latency are based on two action properties: precision and deadline. 
Precision refers to the accuracy of actions, whereas deadline refers to the timeliness of 
events. Games with higher precision and tight deadline are more sensitive to latency. 
For this reason, FPS players always emphasis precision and deadline [20]. 

As it has been reported above, latency plays a very important role when it comes to 
cloud gaming. Despite this fact, there is no work, to the best of our knowledge, dealing 
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specifically with the impact of delay compensation on QoE in the context of cloud 
gaming. Therefore, we have decided to focus on this issue in this chapter. More 
specifically, we showcase lag compensation impact on QoE in a case study involving 
an FPS cloud game. 


3 Lag Compensation 


Figure 2 shows the relationship between the server and the client for cloud gaming. 
The client sends control events to the server over the network, then the server 
samples/executes the input commands and delivers a stream (Audio/Video) back to the 
client. Finally, the client receives and decodes the stream to be portrayed on the screen. 
This round-trip delay is also known as response delay. 


i [| ® 


(I 
Response Delay 


SERVER CLIENT 


Fig. 2. The relationship between the server and the client 


Basically, lag compensation is a technique that attempts to equalise lag for all 
players in a cloud gaming scenario. For example, in Fig. 3, there are two players (P1 
and P2) that are playing an FPS cloud game. 


Server 


Dublin 


Response delay 


Response delay 


Player2 


Player1 


London 


Fig. 3. Two players with different RD 
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The game server is located in Dublin, Ireland, P1 plays in London with an average 
lag of say 100 ms (equal delay in both directions) while P2 is in Galway, Ireland with 
an average lag of say 10 ms, again with equal delay in both directions. Since P1 has a 
longer RD than P2, P1 will have a relatively bad game experience as P2 has an inherent 
advantage. 

To analyse and visualise this lag difference further considers Fig. 4, where the 
player in Galway (P2) has moved from Position A to Position B. 


Fig. 4. Example of Lag in an FPS game (taken from https://developer.valvesoftware.com/wiki/ 
Source_Multiplayer_Networking) (Color figure online) 


The red hitbox shows the A position where P2 was prior to moving. However, due 
to longer RD for P1, his view of the game still shows P2 at position A. When P1 
executes a shoot action, the gameplay information is sent to the server. When this 
command arrives at the server, the target P2 has already moved to position B. As a 
result, P1 misses P2 even though P1 correctly aimed at the opponent in his view of the 
game. To eliminate this issue, lag compensation is needed on server side, such that an 
artificial delay is added to P2 so that both P1 and P2 experience the same lag on both up 
and downlink traffic and the game thus becomes fairer. Figure 5 shows the game after 
lag compensation. 


Dublin 


Response delay Response delay 


Player1 


Fig. 5. Game after lag compensation 


110 Z. Li et al. 


As shown, equal delays are added to both the uplink (client to server or c2s) and 
downlink (server to client or s2c) traffic. The assumption of equal delays in both 
directions is in reality rarely true, largely due to asymmetry in traffic flows. 


3.1 Research Objectives 


Having given a brief overview of literature relating to the QoE of gaming, and intro- 
duced the proposed lag equalization concept, we define two key research objectives 
that are the focus for the remainder of the chapter: 


1. How feasible is it to implement a real-time lag compensation strategy for cloud 
gaming? 
2. Will the lag compensation approach result in improved QoE for FPS gamers? 


4 Implementation 


In this section, we describe the implementation details related to the testbed used to 
address both research objectives. We firstly describe the cloud gaming platform 
Gaming Anywhere that was used, followed by the game that was chosen for the 
platform. Other testbed requirements and tools such as time synchronization and net- 
work emulator are also briefly described. 


4.1 Infrastructure 


As previously mentioned, GA is designed to better bridge the computer game industry 
and the research community. The most attractive feature is its openness, with GA being 
the first open-source cloud gaming platform. Unlike other existing systems, GA allows 
developers and researchers to explore their ideas on a real testbed and extend current 
system. As defined previously, the response delay RD is the time between GA client 
sending input events to GA server and the responding game scenes being displayed on 
the player’s screen. Basically, the RD is composed of three components [21]: 


e Processing delay (PD): it represents the time between when the server receives the 
control events and sends the encoded frames to the client. 

e Playout delay (OD): it is the time to decode and render the decoded frames on the 
screen on the client side. 

e Network delay (ND): it is the time required for a round trip data exchange between 
the client and the server. 


With the platform chosen, the next step was to choose a suitable game with which 
to test/validate our approach. As outlined in the Literature Review, an FPS game was 
most suited as these have the tightest lag constraints, i.e., are more sensitive to lag than 
other game genres. For this reason, the game Assault Cube was chosen. 
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4.2 Game Setup 


To test the basic operation of the approach, a two-player set up was chosen. To ensure 
that players are in the same game but have different game views, two instances of the 
GA server are needed, one per player. This means that each player (client) has its own 
server. One of these servers becomes the Master game server and the other a slave 
server. For this reason, 4 computers were needed to setup the experimental environ- 
ment. For a more scalable implementation, a VM approach is required to run all GA 
servers as described later. Since all machines are in the same university lab and each of 
them connects to the network via IEEE 802.3u Fast Ethernet 100 Mbps switched 
network, network delays are minimal. Figure 6 shows the experimental environment. 


AC host AC client 
connection 
_—_— 
x r; w ~~ 
S e EE 2 
Serverl Server2 
connection connection 
= 
ee: 2 ee, 2 
Client1 Client2 


Fig. 6. The experimental environment 


4.3 Assault Cube Configuration 


Assault Cube (AC) is an FPS game which is based on the CUBE engine and available 
for free on Windows, Linux and OS X. It supports single player and multi-player game 
mode. AC is launched in Serverl first, and this becomes the master/host for a 
multi-player game scenario. AC is then launched in Server2, by selecting multi-player 
mode and joining the game created by Server! as a slave. Figure 6 illustrates the 
connection between the two servers and also between the two GA clients/players and 
their respective GA servers. As shown in Figs. | and 2, the actual data flow is bidi- 
rectional. Once set up, each GA client sends its game actions to its server and receives 
video feed from its own server. 
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4.4 Time Synchronization 


Network Time Protocol (NTP) is designed to synchronise system time of computers 
across IP networks to Universal Coordinated Time (UTC), achieving millisecond 
(ms) level synch or better across well provisioned wired LANs and single ms level 
synch across well configured WANs. In this experiment, NTP plays a key role in 
synchronising time across the GA servers and GA clients. This then facilitates accurate 
delay measurements as outlined in the next section. In our university LAN, the server 
to client (s2c) delay and client to server (c2s) are minimal with RTT measured via the 
ping utility of the order of a few ms or less. Due to NTP tolerances when running on 
Windows platform and resulting clock offsets, the s2c and c2s delays occasionally are 
determined as less than 0 or greater than the RTT. 

For our testbed, Server! was set as the reference clock (NTP server mode) with the 
other GA server and both GA clients setup as NTP clients. With this configuration, 
shown in Fig. 7, time offsets of around 1 ms were typical with occasional fluctuations. 
Although the synchronisation performance of NTP on Windows platform is not as 
good as on Unix type platforms, the levels of synchronization achieved (1-2 ms) are 
sufficient for the case where we are looking at network delays of 100 ms and more. 


Reference Clock 
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Fig. 7. NTP setup 
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4.5 Each Way Delay Measurement 


The GA platform utilizes the Live555 Realtime Transport Protocol RTP library to 
transport audio/video from server to client. RTP and its companion control protocol 
RTCP are very widely used in VoIP conferencing software such as WebRTC and 
Facebook, WhatsApp voice clients. For our purposes, the RTCP library source code 
was modified to enable calculation of each way delay. By default, RTCP traffic, which 
runs in parallel to the media RTP flows for both audio and video, enable the calculation 
of Round Trip Delay, i.e., RTT minus residency time on remote host (described above 
as Network Delay in [21]). By adding code to also return the local timestamp when the 
RTCP receiver report packet is sent back, this facilitates the calculation of both 
upstream and downstream delay for each stream, once NTP is correctly implemented. 


4.6 Network Emulator 


In a real Cloud Gaming environment, the players are often at different geographic 
locations with different network latency, resulting in an unfair game for the player with 
longer RD. In order to emulate this scenario and thus test the implementation of lag 
equalization, Network Emulator for Windows Toolkit (NEWT) (available from https:// 
blogs.technet.microsoft.com/juanand/2010/03/05/standalone-network-emulator-tool/) 
was used on the server side to emulate different network environments for both uplink 
(c2s) and downlink (s2c) traffic for each player. 


4.7 Centralized Log Server 


Since multiple GA server instances are required — one per player, a centralized log 
server is required to collect delay data in real-time from each GA server, perform QoE 
analysis, and then transmit required up and downlink lag compensation delays back to 
each corresponding GA server. The data sent from each server to the centralized log 
server includes synchronization source SSRC, server to client (s2c) delay, client to 
server (c2s) delay, IP address, and port number. Figure 8 illustrates the data generated 
and gathered by each GA server into a char array, and sent to a centralized log server in 
real-time. 


Data: SSRC serverToClientDelay clientToServerDelay IP portForDownstream portForUpstream 


Fig. 8. Data structure in GA server 


Figure 9 illustrates the centralized log server within the whole architecture. Once it 
receives data, it will compare s2c delay and c2s delay between different GA servers, 
and determine what compensating delay to add on upstream (player control actions) 
and downstream (video/audio) in real-time for each player. 

UDP sockets are used instead of TCP sockets to minimize delay and maximize 
responsiveness of system — it does not need any connection setup such as 
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Client1 Client2 


Fig. 9. Centralized log server 


three-way-handshake, and ignores lost packets which would otherwise incur more 
delay when sending data. UDP sockets are thus an appealing choice for non-critical 
delay-sensitive applications. 

Since data contains characters and integers, we use an object array to store data 
together. Figure 10 illustrates the data structure in centralized log server. Once the 
centralized log server receives data from more than 1 player, it starts to do the analysis 
and generates the table of data above. Firstly, from incoming data packets the cen- 
tralized log server determines the player with the largest s2c delay for audio and video 
respectively. Based on upstream delays for RTCP RR traffic, it calculates an average 
(of video/audio) c2s delay also. As outlined earlier, the threshold of roundtrip delay for 
FPS games is 100 ms. For this reason, we implement a threshold such that whenever 
the round-trip delay is greater than 100 ms, the data for that player is not considered as 
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SSRC | s2c Added c2s Avg c2s | Added up | IP | Down Up 
delay | down delay | delay | stream stream | stream 
stream delay port port 
delay 
Data[0] 
Data[1] 


Data[2] 


A- = 


Fig. 10. Data structure in centralized log server 


it makes no sense to penalise every player with more than 100 ms round-trip time after 
lag compensation. 

As shown in Fig. 11, Client! and Client2 have 30 ms and 50 ms round-trip time 
respectively while client3 has 120 ms. Since the threshold is 100 ms, the centralized 
log server will just do comparison of data for Client 1 and 2. 
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Fig. 11. Threshold network latency for FPS games 
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Once determined, the centralized log server sends recommended compensation 
delay data back to each GA server based on IP address. The respective GA servers then 
introduce these additional delays on upstream control traffic (player actions) and 
downstream data (audio/video). 

Figure 12 shows the full structure of testbed in this experiment and also outlines the 
data flows as well as role of network emulator and lag compensation. 


— > Audio 

—> Video Centralized Log 

—> Control Event Analyzed Data Server (QoE) Analyzed Data 
Add Delay 


[ Emulate Delay 


| created by S1 10ms 10ms 


pæ = 


Client1 


Fig. 12. Full system architecture and data flows 


As shown above, we introduce emulated delays of 20 (10/10) and 40 (20/20) ms to 
Client! and Client2 respectively. This results in lag compensation of 10 ms on both up 
and down link traffic for Client2 — shown in grey above so that c2s and s2c are equal 
for both clients. Note that the delays shown here are symmetric, which is rarely the case 
in reality. Further details on setting up the testbed and background work can be found 
in [22]. 
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5 Results 


In this section, we outline results that address both research objectives from Sect. 3.1. 
We firstly present a range of results for the 2-player scenario in order to validate the 
performance of the lag compensation mechanism. We present baseline delays with no 
emulated delays to get a sense for delays as measured across the university LAN 
network and the possible need for lag compensation to cope with inherent delay 
asymmetries. We then outline a range of tests whereby we introduce both up and 
downlink delays for different players using the network emulator and see how the lag 
compensation mechanism performs. Moving to objective 2, we then present results of 
preliminary subjective tests that exposed players to a range of delays with and without 
lag compensation and captured the resulting QoE scores. The section concludes with a 
discussion of ongoing and future work. 


5.1 Baseline Results 


Figure 13 shows the analysed data processed by the log server from the two 
servers/players deployed across the LAN network environment with no artificial delays 
added by the network emulator. Player 1 data is shown in the first 2 data rows, Data 1 
for video and Data 2 for audio. Player 2 data is shown in the next 2 rows, Data 3 for 
video and Data 4 for audio. As outlined above, the columns (left to right) show SSRC 
(v for video and a for audio), s2c delay, delay added to downstream (s2c) traffic, c2s 
delay, average c2s delay (of video and audio stream), delay added to upstream traffic 
(c2s), IP address and 2 ports. The average c2s is calculated (from audio and video 
RTCP traffic) and used to delay the uplink control traffic. 


Delay add on 
downstream 


Delay add on 


upstream 


Fig. 13. Analysed data with no emulated delays 
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The output shows how the lag compensating mechanism is achieving the goal of 
equalising the delays for both players. For example, Player 1 measured s2c delays are 
7/5 ms for v/a respectively whereas Player 2 has 6/1 ms for v/a respectively. Therefore, 
in order to equalise this, compensating delays for Player 2 s2c of 1/4 ms are added for 
v/a so that totals for Player 2 are 7 (1 + 6) and 5 (4 +1) i.e. same as Player 1. 

Moving to c2s delays, Player 1 has measured c2s delays of 4/4 ms for v/a 
respectively whereas Player 2 has 6/6 ms for v/a respectively. Therefore, in order to 
equalise this, Player 1 c2s added delays are 2 for v/a so that total for Player 1 is 6 
(4 + 2) i.e. same as Player 2. 


5.2 Emulated Delays 


In order to fully test the lag compensation mechanism, we then added delays using the 
NEWT emulator and monitored both how quickly these delays were picked up by the 
mechanism and then how compensating delays were added to equalise delays for both 
players. We firstly added 10 ms to both up and downlink delay for Player!/Client1 and 
20 ms delay for up and downlink delay for Player2/Client2. As shown in Figs. 14 and 
15, we use “ping” to validate performance of network emulator which returned aver- 
ages of 19 and 39 ms. 


Pinging 172. 
Reply From 


y from 
Reply From 
ly from 


}.7.161- 
ived d. Lost 
Approximate round tri in milli-sei 
Minimum 19ms,. Maximum 28ms,. five 


Fig. 14. Validating network emulator in client] 


inging 192.168.9.125 with 32 bytes of data: 

eply from 192.168.9.125: bytes=32 time=39ms TTL=128 
eply from 192.16868.9.125: hytes=32 time=39ms TTL=128 
eply from 192.168.9.125: bytes=32 time=39ms TTL=128 
eply from 192.168.9.125: bytes=32 time=39ms TTL=128 


ing statistics for 192.168.9.125: 

Packets: Sent = 4. Received = 4. Lost = Ø øz loss>, 
pproximate round trip times in milli-seconds: 

Minimum = 39ms. Maximum = 39ms. Average = Jms 


Fig. 15. Validating network emulator in client2 
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Figure 16 shows the analysed data for two players under these emulated network 
environments. It can be seen that the actual delays as measured by modified RTCP code 
are not 10/10 and 20/20 as implemented by the emulator. The delays are closer to those 
seen above under zero emulated delays plus the emulated delays plus additional noise 
caused by non-determinism in various application software and OS stack etc. This 
results for example in s2c delays for Player 1 of 15/19 ms v/a, c2s delays of Player 1 
15/12 v/a, and for Player 2, s2c v/a delays of 26/30 ms and c2s delays of 24/26 ms v/a. 


Delay add on 
downstream 


Delay add on 
upstream 


Fig. 16. Analysed data with different network environments 


Based on these results, GA server! requires: 


e additional delay of 11 ms on downstream s2c v/a stream so that Player | has a total 
of 26 ms (15 + 11) for video and 30 (19 + 11) for audio — same as Player 2, 
e additional delay of 12 ms on upstream so that totals are 25 ms — same as Player 2. 


We then changed network latency to 30 ms (15 ms up/downstream) and 60 ms 
(30 ms up/down stream) for Player 1 and 2 respectively. The results are shown in Fig. 17. 
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downstream 
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Fig. 17. Analyzed data after changed network latency 
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As above, the log server detects the changed network conditions and communicates 
the appropriate changes to respective streams back to GA servers in order to equalise 


delays. 
Figure 18 shows the full set of emulated delays implemented for 6 different tests 


and Fig. 19 illustrates the resulting analysed data. 
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P1 Down 
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Fig. 19. Analysed data for tests (Index 0-5) with different network latency 
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The data (left to right) has two additional columns outlining total s2c and c2s delays 
(measured + compensation delays). The columns are: 


SSRC, 

s2c delay, 

delay added to downstream s2c, 

total s2c delay after lag equalization, 
c2s delay, 

avg c2s delay, 

delay added to downstream c2s, 

total c2s delay after lag equalization and 
IP address. 


In each case, the mechanism is seen to work correctly by firstly detecting the impact 
of the emulated delays (including noise) typically within one second and then imple- 
menting lag compensation so that total c2s and s2c delays for player 1 and 2 are equal. 


5.3 Preliminary Subjective Testing 


Although the initial plan was to setup all GA servers in virtual machines for subjective 
testing, some technical problems and limitations arose, and thus the tests were carried 
out using separate servers. Since Assault Cube can supports a maximum of 5 players in 
a LAN, we thus used 10 computers (5 as GA servers and 5 as GA clients) to run the 
game. As shown in Fig. 20, a multiplayer game was created in Server! and others 
joined the game created by Serverl. The GA connection between GA servers and GA 
clients was then established. NTP was also configured in both GA servers and clients to 
synchronise system time. Finally, the network emulator was used to introduce artificial 
delay to implement different network environment. 
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Fig. 20. Testbed 
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Test Scenarios 

10 people were divided into two groups (5/5) to perform the test. All of the participants 
were postgraduate students and were familiar to varying degrees with gaming. As 
shown in Fig. 21, each group (Player | - Player 5) firstly played the game without any 
emulated network delay, thus providing a baseline for the tests. A series of delay 
scenarios (10 in total - uplink/downlink in ms) were then introduced for each player 
both with and without lag compensation with each scenario lasting 3 min. After each 
scenario, players were given a small amount of time to fill out a questionnaire and 
report the overall QoE and game fairness on a scale of 1-5. Both groups underwent the 
same series of scenarios. 


P1 P2 P3 P4 P5 | Lag Compensation 
0/0 0/0 0/0 0/0 0/0 No 
40/40 | 0/0 | 10/10 | 20/20 | 30/30 No 
40/40 | 0/0 | 10/10 | 20/20 | 30/30 Yes 
0/0 | 10/10 | 20/20 | 30/30 | 40/40 No 
0/0 | 10/10 | 20/20 | 30/30 | 40/40 Yes 
10/10 | 20/20 | 30/30 | 40/40 | 0/0 No 
10/10 | 20/20 | 30/30 | 40/40 | 0/0 Yes 
20/20 | 30/30 | 40/40 | 0/0 | 10/10 No 
20/20 | 30/30 | 40/40 | 0/0 | 10/10 Yes 
ES SS Ee EE 
30/30 | 40/40 | 0/0 | 10/10 | 20/20 No 
ff S) 
30/30 | 40/40 | 0/0 | 10/10 | 20/20 Yes 


Fig. 21. Test scenarios 


Test Results 

The overall QoE in absence of lag equalization reported by participants is shown in 
Fig. 22. Whilst the sample size is small, the bar chart clearly shows that the overall 
QoE decreased with increasing delay which is in line with other studies on FPS games. 
Figure 23 illustrates the perceived game fairness for different emulated delay scenarios 
with and without lag compensation. Again, whilst preliminary and based on a small 
sample size, the results clearly show that in absence of lag compensation, higher 
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relative network latency results in lower game fairness. However, the game fairness 
remains high once lag compensation is introduced. It is very interesting to note that 
there was no decrease in QoE as emulated delays increased, presumably as all par- 
ticipants were experiencing the same delays and values were less than the 100 ms 
threshold that is reported as being the threshold for acceptability in FPS games. 
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Fig. 22. Average overall QoE reported for different test scenarios. Delays portrayed on the x 
axis indicate RTTs. 
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Fig. 23. Perceived game fairness 


More comprehensive and rigorous tests are planned to fully evaluate the effec- 
tiveness of our approach. More detailed results from the above preliminary tests are 
available in [22]. 
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5.4 Next Step — Virtualization 


The above tests were carried out using dedicated servers for each player GA instance. 
In order to scale up the testbed, we plan to run all GA server instances using Virtual 
Machines (VM). This will also help to eliminate some of the non-determinism seen 
above in the results. This architecture is shown in Fig. 24, whereby each GA server 
runs in a dedicated VM and sets up connections with each client. 


BEE Virtual Machine 


Ts ae ae 2 ae © aioe © 
Client1 Client2 Client3 Client4 Client5S 


Fig. 24. Run GA server instances in a VM 


6 Conclusions 


In this chapter, we examine cloud gaming from the delay (lag) perspective, and in 
particular the impact of lag on QoE of gamers. Cloud gaming is an emerging service, 
which combines cloud computing and online gaming. It opens a promising direction 
for the computer games industry but several challenges still remain to provide good 
QoE for every player. We review the literature and then analyze certain QoS-related 
key factors and characteristics, which influence the QoE for cloud gaming, especially 
for so-called FPS games. The conclusion is that for FPS games, network latency 
presents one of the most important QoE factors. In this context, we propose a lag 
equalization strategy to level the playing field in the context of QoE and outline two 
research objectives. The first objective is to examine the feasibility of implementing a 
real-time lag compensation mechanism for cloud gaming. To meet this objective, 
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we implemented a cloud gaming system with both up and downlink lag compensation 
based on the Gaming Anywhere platform, an open platform for researchers. The FPS 
game Assault Cube was used to showcase the implementation. The mechanism uses a 
modified version of the RTCP protocol along with NTP to ensure adequate time 
synchronization to yield accurate each way delays. These are communicated to a 
centralized monitoring service that then determines and communicates back to each 
server the necessary up and downlink compensation delays. The lag compensation 
approach was evaluated in an emulated environment whereby a series of tests were 
carried out with differing uplink and downlink emulated delays to emulate differing 
network conditions. The results validate the mechanism by successfully implementing 
real-time delay equalization. To meet objective 2, we then carried out preliminary 
subjective tests with a small group of participants. Results firstly confirmed the impact 
of lag on QoE as detailed in the literature review and then validated our lag com- 
pensation approach whereby the reported QoE remained high for high delay values 
once equalization was implemented. Our future research will firstly optimize the 
experimental cloud gaming environment by introducing virtual machines for scala- 
bility. More comprehensive subjective QoE tests using the proposed lag compensation 
approach will then be undertaken to more rigorously evaluate its effectiveness. 
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Abstract. Video streaming has become an indispensable technology in 
people’s lives, while its usage keeps constantly increasing. The variabil- 
ity, instability and unpredictability of network conditions pose one of the 
biggest challenges to video streaming. In this chapter, we analyze HTTP 
Adaptive Streaming, a technology that relieves these issues by adapting 
the video reproduction to the current network conditions. Particularly, 
we study how context awareness can be combined with the adaptive 
streaming logic to design a proactive client-based video streaming strat- 
egy. Our results show that such a context-aware strategy manages to 
successfully mitigate stallings in light of network connectivity problems, 
such as an outage. Moreover, we analyze the performance of this strategy 
by comparing it to the optimal case, as well as by considering situations 
where the awareness of the context lacks reliability. 


Keywords: HTTP Adaptive Streaming - Video streaming 
Context awareness - Quality of Experience - Stalling probability 


1 Introduction 


1.1 Motivation 


The rising number of smart phone subscriptions, which are expected to reach 
9.2 billion by 2020, combined with the explosive demand for mobile video, which 
is expected to grow around 13 times by 2019, accounting for 50% of all global 
mobile data traffic, will result in a ten-fold increase of mobile data traffic by 
2020 [1]. This explosive demand for mobile video is fueled by the ever-increasing 
number of video-capable devices and the integration of multimedia content in 
popular mobile applications, e.g. Facebook and Instagram. Furthermore, the use 
of video-capable devices, which range from devices with high resolution screens 
to interactive head mounted displays, requires a further increase of the band- 
width, so that on-demand video playback can be supported and differentiated 
expectations raised by the end video consumers can be satisfied. 
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Following this trend for video streaming, mobile network operators, and ser- 
vice providers focus on the Quality of Experience (QoE) of their customers, con- 
trolling network or application-level parameters, respectively. In parallel, from 
the user’s side, a better QoE enhancement can be achieved if both network- 
and application-level information are utilized (cross-layer approaches). On top 
of that, greatest gains can be possible if also “context information” is used by any 
of these parties, complementary to the usually available Key Performance and 
Key Quality Indicators (KPIs and KQIs), to which the service/network providers 
already have access. As a general conclusion, (ideally) cross-party, cross-layer, 
and multi-context information is required towards devising mechanisms that will 
have the greatest impact on the overall user QoE. 

In parallel, since most of the consumed video of a mobile data network is 
delivered through server-controlled traditional HTTP video streaming, the abil- 
ity of such monolithic HTTP video streaming to support a fully personalized 
video playback experience at the end-user is questioned. To this end, this tradi- 
tional technique is gradually being replaced by client-controlled video streaming 
exploiting HTTP Adaptive Streaming (HAS). HAS can split a video file into 
short segments of a few seconds each, with different quality levels and multiple 
encoding rates, allowing a better handling of the video streaming process, e.g. 
by adapting the quality level of future video segments. HAS is a key enabler 
towards a fully personalized video playback experience to the user, as it enables 
the terminal to adapt the video quality based on the end device capabilities, the 
expected video quality level, the current network status, the content server load, 
and the device remaining battery, among others. 

In this chapter, our objective is to investigate how context awareness in 
mobile networks can help not only understand but also enhance the user expe- 
rienced quality during HAS sessions. We study a scenario where users travelling 
with a vehicle experience bad or no service at all (i.e. a service outage). In this 
or similar type of scenarios, the opportunity emerges to propose novel, preemp- 
tive strategies to overcome such imminent problems, for instance by proposing 
proactive adaptive streaming or buffering techniques for video streaming ser- 
vices. This scenario has been modelled, optimized and investigated by means 
of simulation. Before presenting the problem under study, we first identify the 
need and the changes needed to move from a QoE-oriented to a context-aware 
network/application management. 


1.2 From QoE-Awareness to Context- Awareness 


QoE is defined as “the degree of delight or annoyance of the user of an appli- 
cation or service” [2], and as such, it is an inherently subjective indication of 
quality. Consequently, a significant amount of research efforts has been devoted 
to the measurement of this subjective QoE. The goal of these efforts is to find 
objective models that can reliably estimate the quality perceived/experienced by 
the end-user. To this end, subjective experiments that involve human assessors 
are carefully designed, with the purpose of mapping the various quality influence 
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factors to QoE values. In [2], these influence factors are defined as “any charac- 
teristic of a user, system, service, application, or context whose actual state or 
setting may have influence on the Quality of Experience for the user”, and they 
are basically classified into three distinct groups, namely, Human, System, and 
Context factors. Human influence factors include any psychophysical, cognitive, 
psychological or demographic factors of the person receiving a service, while 
system influence factors concern technical parameters related to the network, 
application and device characteristics and parameters. Finally, context relates 
to any spatio-temporal, social, economic and task-related factors. 

The awareness of QoE in a network is a valuable knowledge not only per 
se (namely for network monitoring and benchmarking purposes) but also as a 
useful input for managing a network in an effective and efficient way. The “QoE- 
centric management” of a network can be performed as a closed loop procedure, 
which consists of three distinguishable steps: 


QoE Modelling: For the purposes of QoE modelling, key influence factors that 
have an impact on the network’s quality need to be mapped to QoE values. To 
this direction, QoE models have to be used that try to accurately reflect /predict 
a subjective QoE estimation. 


QoE Monitoring: This step provides answers on how, where and when QoE- 
related input can be collected. It includes the description of realistic architectures 
in terms of building blocks, mechanisms, protocols and end-to-end signalling in 
the network. Also, this procedure relates to the way in which feedback concerning 
QoE measurements can be provided from end-user devices and any network 
nodes to the responsible QoE-decision making entities in the network. 


QoE Management and Control: This step includes all the possible QoE- 
driven mechanisms that can help the network operate in a more efficient and 
qualitative way. These mechanisms may include for instance power control, 
mobility management, resource management and scheduling, routing, network 
configuration, etc. All these procedures can be managed based on QoE instead 
of traditional Quality of Service (QoS) criteria and their impact can be assessed 
based on the QoE they achieve. Multiple variants of the three previous steps or 
building blocks can be found in the literature, such as [3,4]. 

“Context” may refer to “any information that can be used to characterize the 
situation of an entity” [5]. In this way, context awareness can facilitate a tran- 
sition from packet-level decisions to “scenario-level” decisions: Indeed, deciding 
on a per-scenario rather than on a per-packet level may ensure not only a higher 
user QoE but also the avoidance of over-provisioning in the network. This huge 
potential has been recently identified in academia and as a result, research works 
on context awareness and context-aware network control mechanisms are con- 
stantly emerging in the literature. In [6], a context aware handover management 
scheme for proper load distribution in an IEEE 802.11 network is proposed. In 
[7], the impact of social context on compressed video QoE is investigated, while 
in [8] a novel decision-theoretic approach for QoE modelling, measurement, and 
prediction is presented, to name a few characteristic examples. 
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If we now revisit the three-step QoE control loop described earlier by also 
considering context awareness, then this is enriched as follows: 


Context Modelling: Based on the earlier discussion about the QoE modelling 
procedure, we may observe that the System as well as the Human influence 
factors are directly or indirectly taken into account in the subjective experiments’ 
methodologies, e.g. [9]. Consequently, the impact of technical- and human-level 
characteristics is tightly integrated into the derived QoE models. Nevertheless, 
the Context influence factors are mostly missing in these methodologies, or are 
not clearly captured. This happens due to the fact that the QoE evaluations are 
usually performed in controlled environments, not allowing for diversity in the 
context of use. Besides, context factors are challenging to control, especially in a 
lab setting, and new subjective experiment types would have to be designed. As 
a consequence, the mapping of context influence factors to QoE is absent from 
most QoE models that appear both in the literature and in standardization 
bodies. Therefore, novel context-aware QoE models need to be devised that are 
able to accurately measure and predict QoE under a specific context of use, as 
these context factors are (often) neglected. These context factors could either 
be integrated inside a QoE model directly, or, be used as a tuning factor of an 
otherwise stand-alone QoE model. 


Context Monitoring: On top of QoE monitoring, context monitoring proce- 
dures could (and should) be implemented in the network. These procedures will 
require different input information from the ones used by traditional QoS/QoE 
monitoring techniques. The acquired context information may be used for 
enhancing the QoE of the users or for the prediction of imminent problems, such 
as bottlenecks, and may range from spatio-temporal to social, economic and 
task-related factors. Some of the possible context information that may be mon- 
itored in a network is the following (to give a few examples): the current infras- 
tructure, which is more or less static (access points, base stations, neighbouring 
cells, etc.), the specific user’s surrounding environment (location awareness, out- 
doors/indoors environment, terrain characteristics, presence of blind spots such 
as areas of low coverage or limited capacity, proximity to other devices, etc.), 
the time of day, the current and predicted/expected future network load, the 
current mobility level or even the predicted mobility pattern of users in a cell 
(e.g. a repeated pattern), the device capabilities or state (e.g. processing power, 
battery level, storage level, etc.), the user task (e.g. urgent or leisure activity), 
as well as application awareness (e.g. foreground or background processes), and 
social awareness of the end-users, among others. Moreover, charging and pricing 
can also be included in the general context profile of a communication scenario. 
It needs to be noted here that context awareness does not necessarily rely on 
predicting the future (e.g. future traffic demand) but also on solid knowledge 
that is or can be available (e.g. time of day, outage location, etc.). 


Context-Aware Management and Control: Three possibilities emerge in a 
context-aware network. First, the network can take more sophisticated control 
decisions that are also influenced by context-awareness, such as for instance, 
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a decision to relax the handover requirements for a user in a fast-moving vehi- 
cle or a decision to connect a device with low battery to a WiFi access point. 
Second, the network can actualize control decisions exploiting the current con- 
text. For instance, it can exploit information about flash crowd formation to 
drive an effective Content Distribution Network (CDN) load balancing strategy 
[10] or, more generally, to take control decisions proactively based on context 
information about the near future. Finally, context-awareness can help to take 
decisions with the objective to increase the network efficiency as measured in 
spectrum, energy, processing resources, etc., and consequently to reduce opera- 
tional expenses. For instance, context information could allow for a more mean- 
ingful distribution of the network resources among competing flows that refer to 
different communication scenarios. 

This book chapter handles a characteristic use case of context-aware man- 
agement, to showcase its potential. More specifically, we study a scenario where 
“context awareness” refers to awareness of the location and duration of a forth- 
coming outage, namely of a restricted area of very low or zero bandwidth (e.g. 
limited coverage due to physical obstacles or limited capacity due to high net- 
work congestion). Based on this knowledge, we devise a proactive HAS strategy 
that will enhance the viewing experience of a user travelling inside a vehicle 
towards this area. 

Related work involves HAS strategies that use geo-location information 
({11,12]), and evoke users to send measurements regarding their data rate, so 
that an overall map of bandwidth availability can be created for a certain area. 
Other HAS techniques rely on prediction, rather than context-awareness. For 
instance, [13] describes a HAS method where higher quality segment requests 
are a posteriori replaced with lower ones, as soon as a zero-bandwidth spatio- 
temporal event is identified. Moreover, similarly to our approach, [14] proposes 
an anticipatory HAS strategy, which requires prediction of the channel state 
in terms of Received Signal Strength (RSS) and proactively adjusts the user’s 
buffer. An optimization problem is formulated that minimizes the required num- 
ber of spectrum resources, while it ensures the user buffer is better prepared for 
an imminent coverage loss. The authors even conducted a demo of this approach 
in [15] that serves as a proof of concept. Our difference with this approach, is 
that we rely on longer-term context-awareness rather than imminent channel 
prediction, and that instead of manipulating the user buffer size, we proactively 
adapt the video quality selection. Finally, [16] combines RSS information with 
localization sensors from the smart phones that reveal the user’s coverage state 
and help achieve a smoother and stabler HAS policy, called Indoors-Outdoors 
aware Buffer Based Adaptation (IOBBA). 


2 System Analysis 


2.1 System Model 


The environment under study is a mobile cellular network. We consider a cell, 
where one base station is offering connectivity to multiple users, residing inside 
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the cell. Here we focus on TCP-based video streaming service users (e.g. YouTube 
videos) and therefore, we focus only on the Downlink (DL). 

Due to the challenges introduced by the access part of the network, namely 
due to pathloss, shadowing, fading and penetration losses, as well as due to 
the mobility of the users within this cell, the channel strength and quality may 
fluctuate significantly from user to user, from location to location, and from time 
to time. The existence of an outage inside a cell poses a high risk for the viewing 
experience of mobile video streaming users, since it might lead to a stalling event. 

In the context of this scenario and with the assistance of Fig.1, we can 
mathematically represent the system model and problem statement. Assume 
that a video streaming user is inside a vehicle (such as a bus or train), which is 
travelling with a particular direction and with a specific speed. We assume, that 
the positioning and the length of an upcoming outage are known in advance (due 
to context awareness). As a result, the remaining distance between the vehicle 
and the outage’s starting point is also available at the client side. This distance 
corresponds to a travelling time of tqist, namely the time required until the user 
enters the outage region. Let b be the current buffer status of this user’s HAS 
application; Then, during taąist, this buffer level will be boosted by b+ but also 
reduced by b_. Similarly, throughout the outage duration, the buffer will be 
boosted by boutage+ but also reduced by boutage—. When the user enters (exits) 
the outage region, the application’s buffer level will be boutage—in (Doutage—out): 
respectively, and it will hold that: 


boutage—in =b+ by =b (1) 


boutage—out = Doutage—in = boutage— (2) 
because boutage+ IS assumed equal to zero, namely there is negligible or no con- 
nection to the base station inside the outage region. Then, we can express the 
objective of the proposed HAS strategy as the following: 


boutage—out > bthres (3) 


which means that when the vehicle is exiting the outage region, the buffer status 
of the HAS application should be at least equal to the minimum buffer threshold, 
binres, Which ensures that the video playout continues uninterrupted. Note that, 
a stalling always occurs when b < binresn. The last condition can be re-written as: 


by. 2 Dthres +b + boutage— —b (4) 


This condition answers the question about how much should the buffer of the 
HAS application be pro-actively filled during tas (namely from the time of 
reference up to the outage starting point), so that no stallings will occur. This 
should be achieved despite the imminent connection disruption. Note that all 
the parameters on the right hand side are known to the client or can be easily 
estimated (binres is fixed, b is directly known to the client application, while b_, 
boutage— can be estimated). It needs to be stressed out that all previous buffer- 
related variables may be expressed either in seconds, i.e. buffer playtime, or in 
bytes, i.e. buffer size. 
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Based on the previous system model we can estimate the b}, namely the 
required buffer boost (in bytes or in seconds) to avoid any stalling during the 
outage duration. This measurement can be then further translated to a required 
“advance time”, taav, until which the travelling user needs to be notified about 
the existence of the outage (namely, its starting position and duration), in order 
to run the proactive HAS strategy proposed here. We assume that the users 
switch from a standard HAS strategy to the adapted one exactly at taduv. We 
can express b} as a function of tag, as follows: 


by =T * tadi (5) 


where r (bytes per second) is the estimated data rate by the client’s application. 
Namely, r is the user’s prediction of the available network bandwidth, as esti- 
mated by the HAS strategy. Therefore, the minimum required advance time in 
order to avoid any stalling would be: 


bthres +b + boutage— —b 
T 


tadv 2 


(6) 


To avoid a stalling, taqv should be less than the remaining tg;,4, namely the user 
should be notified early enough to react. 


— bthres 
— >  boutage-out 


boutage- 


tout 
ORGE boutage+ > 0 


—> bo utage-in 


b- 


tdist be 


Fig. 1. Problem description using buffer status information. 


2.2 Optimization Problem 


The goal of this section is to formulate a problem that achieves optimal segment 
selection with respect to three different optimization objectives, described next. 
The optimization problem is formulated using the following notation!: 


1 [17] is used as a reference. 
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— T is the length of each segment in seconds. 

— To is the initial delay of the video. 

— D; is the deadline of each segment i, meaning that this segment needs to be 
completely downloaded up to this point. 


Then: 
D; = To +17, Vi = Leen (7) 


Also: 


— nis the total number of segments that comprise the video. 

— Tmax is the maximum number of available layers/representations. 

— qij represents segment i of layer j. 

— wij is the weighting factor for the QoE of segment i of layer j. Here, we use 
the quality layer value as weighting factor = {1,2,3}. 

— Sij is the size of segment i of layer j (e.g. in bytes). 

— b(t) is the total data downloaded until the point in time t. We assume perfect 
knowledge of b(t). 

— ais the weight for the impact of the quality layer and ( for the impact of the 
switches (a+ 8 = 1,a > 0,8 > 0). 


QoE studies on HAS (e.g. [18,19]) have revealed that major quality influence 
factors are in order of significance: a) the layers selected and especially the time 
spent on highest layer and b) the altitude, i.e. the difference between subsequent 
quality levels (the smaller the better). Other factors with less significance are: the 
number of quality switches, the recency time and the last quality level. Taking 
these findings into account, we focus on three different types of optimization 
objectives, which aim to maximize the positive impact of higher level selection, 
deducing the negative impact of quality switches and altitude. Three different 
versions of optimization objectives are thus formulated, as follows: 


— Optimal strategy “W” accounts only for the impact of the quality layers, 
trying to maximize their value, so that the highest layer will be favored over 
the intermediate layer, which will be preferred over the lowest layer. 

— Optimal strategy “W+5” additionally accounts for the number of switches, 
trying to minimize their occurrence. 

— Optimal strategy “W+5+A” additionally accounts for the altitude effect, 
trying to minimize the distance between subsequent layers, thus preferring 
direct switches e.g. from layer 1 to layer 2 rather than from layer 1 to layer 3. 


This leads us to the three different formulations of the optimization problem for 
one user: 
— W: Maximize the quality layer values: 


nm Tmax 


maximize 5 5 QAWij Tij (8) 


i=1 j=1 
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— W+5S: Maximize the quality layer values minus the number of switches: 


THis 1 n—-1Tmaz 
maximize 5 x QWijtij — 5 5 5 Bltij — Bi41,5)° (9) 
i=1 j=1 i=1 j=l 


— W+5+A: Maximize the quality layer values minus the number of switches 
and the altitude difference: 


Nn Tmax n-1Tmax 


maximize X 5 Wij Lig — >> 5 [es — ri41,3)° + ee) (10) 


i=1 j=1 i=l j=l lp — j| 


where 
p= {1..rmar} — {J} 


Despite its complication, the terms in the last parenthesis of Eq. (10) represent 
the preference over switches between “neighbor” layers (i.e. after a layer 1 selec- 
tion, layer p = 2 switches will be preferred/after a layer 2 selection, either layer 
p = 1l or p = 3 switches will be preferred/while after a layer 3 selection, layer 
p = 2 switches will be preferred). 

All above optimization objectives are subject to the following constraints: 


Tij E {0, 1} (11) 
Si ay=l, vi=1,..,n (12) 
j=l 
k Tmas 
2 Pe Sigtig < D(Dg), Vk =1,...,0 (13) 
i=1 j=l 


The three constraints in this problem are interpreted as follows: zij is a 
binary value (Eq. (11)) meaning that a segment is either downloaded or not, each 
segment has to be downloaded in exactly one layer (Eq. (12)), and all segments 
need to have been downloaded before their deadline, so that no stalling occurs 
(Eq. (13)). 


2.3 HAS-Based Strategy 


The proposed strategy needs to avoid stallings during the outage, something 
which is extremely high likely to occur due to the very low network coverage. 
The main idea to ensure that is to pro-actively lower the requested quality level of 
the next segments a priori, i.e. before entering the outage area. As a consequence, 
the buffer at the user side when entering the outage region will be fuller than it 
would have been without such a scheme (see Fig. 2)?. 

As a result of this strategy, the user viewing experience will be less affected, 
not only because the video will continue to play without a stalling for a longer 


? This figure is adapted from [20]. 
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Fig. 2. Adaptive video streaming scenario with and without context awareness. 


period of time, or hopefully will never stall depending on the outage duration, 
but also because the quality level will be gradually decreased and thus the user 
will be better acquainted with lower quality levels. Such progressive quality 
degradations would be preferred in comparison to sudden and unexpected qual- 
ity degradations, especially if the quality level is already very high (cf. the IQX 
hypothesis [21]). Consequently, the objective of the proposed strategy is to com- 
pute the optimal context-based quality level selection to ensure the best QoE 
while avoiding any stallings. 

The HAS strategy is based on the estimation of the required buffer boost b4 
as this was described in Sect. 2.1. As for the estimation of the expected downlink 
rate (network bandwidth prediction), this is assumed equal to the segment rate. 
The segment rate estimation (in bytes per second) is done over a sliding 
window of the past k downloaded segments as follows: 


Size of last (k — 1) segments nee Sizeof segmentk 
Time to download (k — 1) segments ` Time to download segment k 
(14) 
where w is the weight (importance) given to the latest downloaded segment. 
Based on this rate estimation, the expected bytes that can be downloaded until 
the user enters the outage region is: 


r=(l—w)x 


bieapected =T* tadu; (in bytes) (15) 
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while the minimum required buffer playtime to exit the outage region and avoid 
a Stalling is: 


b+ = bihres + b- + boutage- — b, (in seconds) (16) 
Therefore, the required bytes per segment are: 


bi expected 
? 


required video rate = (in bytes per second) (17) 


+ 


Note that the higher the outage duration, the larger the b} and thus the lower 
the required video rate (lower layer selection). Based on the required video rate 
estimation, the HAS strategy will request the highest possible representation j 
that fulfills this condition: 


Sij ; , 
—! < required video rate (18) 
= 


Namely, the layer j that will be requested will be the highest one that yields a 
video bit rate less or equal to this estimation. The “required video rate” esti- 
mation may be updated each time in order to account for the most recently 
achieved data rate r. Alternatively, an average value may be calculated in the 
beginning (on tady) and assumed valid until entering the outage region. In the 
case that the actual available data rate for this user is less than his subjective 
rate estimation, r, there is, however, a higher risk of stalling. We assume that 
the player requests the lowest layer when initialized. 


2.4 QoE Models 
The QoE models that are used in this work are the following: 


— A QoE model for HAS, where no stallings are assumed. This model can be 
found in [17] and it can be described by the following formula: 


QoE = 0.003 « e®-064*t 4 2.498 (19) 


where t is the percentage of the time that the video was being played out at 
the highest layer (here layer 3). 

— A QoE model for TCP-based video streaming, if stallings occur. This model 
can be found in [22] and it is described as follows: 


QoE = 3.5 x exp(—(0.15 x L + 0.19) x N) + 1.5 (20) 


where N is number of stalling events and L is the stalling length. 


For the purposes of this scenario we combine the two aforementioned models, so 
that in case that no stalling has occurred, the former QoE model is used, while 
during and after a stalling event, we use the latter. 
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2.5 Realization in the Network 


In this section, we provide some insights regarding the realization of the proposed 
scheme in a real network. Specifically, the information required so that this 
framework can work already is or can become easily available, namely: 


— The existence and duration of an imminent outage. We assume that “Big 
Data” collection by the mobile operators regarding the connectivity of their 
subscribers can ensure the availability of this information. 

— The user’s moving direction and speed. This can be obtained via GPS infor- 
mation (current location, speed and direction combined with a map). 

— The minimum advance time tady or minimum advance distance £ady at which 
the user has to initiate the proactive HAS strategy. There are two options 
here: either the user knows about the outage a priori and therefore switches 
to the enhanced HAS mode on taav without any network assistance, or the 
user becomes aware of the outage existence, starting point and length on tadv 
by the network and then switches to the enhanced HAS mode. In the first 
case, the user runs an internal algorithm to estimate the taav- 

— Standard information required for the operation of HAS, namely video seg- 
ment availability, network bandwidth estimation, and current buffer state. 


As far as the need for “Big Data” mentioned before is concerned, this may 
take two forms: Either they could be data collected at the device itself because 
the user has the same travel profile every day and, therefore, learns about any 
coverage problems on his way, or, the data are collected at a central network 
point (e.g. at a base station or a server) through measurements collected by any 
devices passing from there. Actually, in Long Term Evolution (LTE) networks, 
such measurements are already available via “Channel Quality indicators - CQI’”. 
CQIs report to the LTE base station (eNB - evolved NodeB) about the quality 
of the received signals (SINR - Signal to Interference plus Noise Ratio) using 
values between 1 (worst) and 15 (best). Currently, CQIs are used only for real- 
time decisions such as scheduling; however, we may envision that CQIs may 
be collected by an eNB on a longer-term time scale (days or weeks), and be 
used in order to create a “coverage profile” of the cell. Following such past 
information, proactive measures could be taken at a cell for users travelling 
towards problematic areas (e.g. a physical tunnel ahead). 


3 Evaluation Results 


For the purposes of evaluation we use Matlab simulation. The client’s buffer 
is simulated as a queuing model, where the “DOWNLOADED” segments are 
arrivals and the “PLAYED” segments are departures. To simulate the network 
traffic, we use real traces recorded from a network [23]. Moreover, to simulate 
congestion we use the parameter “bandwidth factor”, which is a metric of the 
network congestion/traffic and takes values between 0 and 1 (the higher this 
factor the lower the congestion). 


3 The bandwidth factor concept is extracted from [17]. 
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The parameters used in our simulation are presented at Table 1: 


Table 1. Simulation parameters. 


Parameter Value 
Segment duration 2s 
Number of video segments 350 
Number of different representations 3 


(layers) per segment 


Buffer playout threshold (initial delay) 10 segments 


Outage starting point 200s after simulation start 

Outage duration [0..400] s 

HAS policy sliding window 50 segments 

Bandwidth factor 0.8 (unless variable) 

Replications 30, with different network traces each 


3.1 Proof of Concept 


The first evaluation study mainly serves as a proof of concept of the enhanced 
HAS logic. The goal is to demonstrate how a context-aware HAS policy can 
help overcome an otherwise inevitable buffer depletion and thus, an imminent 
stalling event. To demonstrate that, we plot four different metrics: (a) the client 
buffer size in bytes, (b) the client buffer size in seconds (i.e. buffer playtime), 
(c) the HAS layers selected for each played out segment, and finally (d) the QoE 
evolution in time for the travelling user. For the latter, we make the assumption 
that the QoE models presented in Sect. 2.4 hold also in a real-time scale, and 
that the QoE model for HAS holds for the tested scenario where three different 
layers are available per segment. Real-time estimation of the QoE for a partic- 
ular user means that QoE is estimated at every time instant t using as input 
accumulated information about the percentage of time that this user has already 
spent watching the video at layer 3 up to instant t, as long as no stalling has 
occurred yet, or information about the number N and duration L of stalling 
events since t = 0 up to instant t, as long as at least one stalling has occurred. 
As shown in Fig. 3, three different cases are considered, namely (a) the con- 
ventional case, where no context awareness about the outage event is available, 
and consequently, the standard HAS strategy is implemented, (b) the case where 
context awareness about the starting point and duration of the outage event is 
available, which leads to the selection of the adapted, proactive HAS strategy, 
and finally (c) the optimal case (W) described in Sect. 2.2. Examining Figs. 3a 
and b we can see that a stalling of around 80s is completely avoided when context 
awareness is deployed, or when optimal knowledge is assumed. The explanation 
behind the prevention of the stalling lies in Fig.3c. In the “without context” 
case higher HAS layers are selected as compared to the “with context” case 
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Fig. 3. Client behavior with context awareness, without context awareness, and optimal 
behavior (W). 
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(mainly with layer =2), especially around the outage occurrence, which here 
starts and ends at 200s and 400s, respectively. Having downloaded lower HAS 
layers in the “with context” case, the buffer of the client is fuller in terms of 
playtime than it would have been if higher HAS layers had been downloaded 
instead. The impact on QoE for all cases is also presented in Fig. 3d, where we 
can see that even a single stalling event of a few seconds’ duration has a signifi- 
cantly deteriorating impact on the perceived QoE, as compared to the selection 
of lower HAS layers. QoE values per strategy follow the trend of layer selection: 
this is why the “context case” at some periods reveals higher QoE than the 
“optimal” case (the former requests more layer 3 segments before the outage). 

Comparing now the enhanced HAS strategy with the optimal strategy, we 
observe that the latter does a better job in selecting higher quality layers (espe- 
cially layer 2 segments) up to the point of the outage start. The reason is that 
the optimal strategy has full awareness of the future network conditions and 
thus, can take more informed decisions that lead to the highest layer selection 
with zero stalling risk. 


3.2 Required Advance Time Estimation (“Context Time”) 


Next, we study how the outage duration influences the required advance time, 
tadv and present the results in Fig. 4 (mean and standard deviation). We observe 
an intuitively expected trend, i.e. that the user needs to initiate the proactive 
HAS strategy earlier for longer outage durations (i.e. a higher tag, is required). 
In this way, the user has more time to buffer sufficient playtime. Moreover, 
the standard deviation follows the same trend, indicating higher uncertainty 
for longer outages. The required advance time strongly depends on the achieved 
data rate per user, which for the purposes of simulation is a result of the network 
traces and bandwidth factor. 
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Fig. 4. The minimum required advance time (taav) to avoid a stalling event during an 
outage. 
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3.3 Comparison of Different Strategies 


Next we perform a study with respect to the availability of bandwidth, in order 
to evaluate how HAS performs in bandwidth-challenging scenarios. Since we use 
real traces as input information about the data rates in the network, we can 
indirectly enforce a network congestion by multiplying the measured bandwidth 
with the aforementioned bandwidth factor. 
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(a) With context awareness: Minimum required advance time (taav) to avoid a stalling 
event in light of an outage event of 150sec for various bandwidth factors. 
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(b) Without context awareness: Stalling probability for various bandwidth factors. 


Fig. 5. Simulation results for various bandwidth factors with and without context 
awareness. 


The purpose of the first study with regard to the bandwidth factor is to 
investigate how it influences the minimum advance time taqv in the case of con- 
text awareness, and how it influences the stalling probability in the conventional 
context unaware case. The results are presented in Fig.5. As demonstrated in 
Fig. 5a, for very low data rates (e.g. a bandwidth factor of 0.2), the minimum 
required advance time gets higher, as the user would need a much greater time- 
margin to proactively fill the buffer in light of the outage, because the network 
is heavily congested. Moreover, the uncertainty in this case is also very high, 
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a conclusion that we have seen at the previous section as well. On the con- 
trary, the more relaxed the network conditions, the higher the margin for an 
early notification about the outage, while this practically gets zero seconds (i.e., 
no notification is needed) when the network conditions are very relaxed (band- 
width factor = 1). Similar conclusions are drawn for the context-unaware case 
with regard to the stalling probabilities for different bandwidth factors, namely 
the less this factor, the higher the stalling probability, as expected (Fig. 5b). 

Next, we compare the behaviour of the three different types of the optimal 
strategy (i.e. cases W/W+S/W+S+A, as described in Sect. 2.2) both among 
them, but also with the context-aware strategy. In Figs.6a-d, the percentage 
of time spent on each of the three layers as well as the resulting number of 
switches are presented per strategy. All four strategies follow a similar trend as 
bandwidth availability increases, that is higher and higher layer 3 segments are 
selected, while lower and lower layer 1 segments are selected. With respect to 
layer 2 segments, the behaviour is different when the bandwidth factor changes 
from 0.25 to 0.5 (increasing layer 2 selection) from when it changes from 0.5 to 
1 (decreasing layer 2 selection). Note that a bandwidth factor of 0.25 represents 
very high congestion and a bandwidth factor of 1 represents very low congestion. 

Another interesting observation is that strategy W+S5+A “avoids” layer 2 
segments almost completely. The reason behind that is that layer 2 in W+S5+A 
is mostly used as a “transition step” to switch to layer 1 or layer 3, respecting 
the objective to keep the altitude of two sequential layers as low as possible. 
Equation (10) gives the same priority to staying at the same layer and to switch- 
ing to a +1 or —1 layer. Perhaps, this is not necessarily the best action in terms 
of QoE, but there is no complete HAS QoE model to be able to build the perfect 
optimization function. However, the optimization goal of low altitude between 
successive layers holds. On the contrary, strategy W-+S has a tendency to select 
many layer 2 segments, which is explained by its goal to minimize the switches 
and thus operate at a stable but safe level. We have also tested a “W+A” opti- 
mal strategy (not mentioned in Sect.2.2), but this has been found to cause too 
many quality switches; therefore it was not considered for further investigation. 

It is important to note that no optimal strategy is considered “better” than 
the other; They all represent how different optimization objectives behave under 
varying bandwidth conditions. However, once a validated multi-parameter QoE 
model for HAS becomes available in the future, the optimization problem could 
be revisited. 

In terms of quality switches caused, which is another important QoE impair- 
ment factor, the context aware strategy and the optimal W strategy cause the 
highest number of switches, since they do not take measures to prevent them 
(see Fig. 6d). On the contrary, the optimal W+S and optimal W+S-+A strate- 
gies cause the least number of switches. Between the last two, W+S5S+A causes 
more switches, as it puts equal priority to mitigating switches and keeping the 
altitude of any switches at a low level. 
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Fig. 6. Simulation results for various bandwidth factors for the three optimal cases 
W/W+S/W-+S-+A as well as the context-aware strategy. 
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3.4 The Impact of Unreliability of Context Information 


In this section we study how unreliability in the context information influences 
the probability of having a stalling event. In other words, we study how risky 
the proactive HAS strategy is to lead to a stalling, when accurate information 
about the outage starting point is missing or when it is impossible to have this 
information on time. 

For the purposes of this experiment, we assume that the buffer of the user 
is not limited, and therefore the user will continue to download as many bits 
as its connectivity to the base station allows. As a consequence, the starting 
point of the outage plays an important role, since the further away it is from the 
vehicle’s current location, the fuller the buffer of the client will be under normal 
circumstances up to that point. Thus, also the stalling probability will be lower. 
Overall, this study evaluates to what extent an unexpected outage is mapped to 
a stalling probability. 

The results under this perspective are presented in Fig.7. As expected, the 
further away the outage, the less the stalling probability. However, it might 
be more meaningful to conduct the same study assuming a limited buffer size 
of the client’s application, which is a more realistic assumption. In that case, 
we would expect that the starting point of the outage would not play such a 
crucial role, but the maximum size of the buffer would. Note that a normal 
value for an upper threshold in the number of buffered segments would be 50 
segments. However, this study still provides some insights about the impact of 
unexpectancy regarding the outage starting point. 
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Fig. 7. The impact of the outage starting point on the stalling probability. 


Next, we would like to investigate what happens if the context information 
is not communicated to the client as 100% accurate or, similarly, if it is not 
communicated early enough in advance (so it is accurately communicated but 
with some delay). Specifically, we assume that the information about the taav 
deviates from its mean value, as this was estimated in Sect. 3.2. This mean value 
is considered to represent a “0% deviation” in the following figures. From Figs. 8a 
and b, which represent the stalling probability and stalling duration respectively, 


The Value of Context-Awareness in Bandwidth-Challenging HAS Scenarios 


we draw two main conclusions. Firstly, we confirm that the mean values of tadv 
are not enough to prevent a stalling, due to the fact that standard deviations 
have not been taken into account. In fact, as presented in Sect. 3.2, the standard 
deviations are higher for larger outage lengths and thus we observe higher stalling 


probabilities for the 0% values (compare the three plots per figure). 


A second important conclusion, which is the emphasis of this simulation 
study, is that a potential uncertainty in this context information can lead to 
inevitable stallings. This is interpreted both in terms of stalling probabilities 
and stalling lengths. This emphasizes the need for accurate and timely context 
information, which also takes into account statistical metrics such as the stan- 


dard deviation. 
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Fig. 8. Stalling effects when taa, deviates from its mean value. 


4 Conclusions 


In this chapter, a novel proactive HAS strategy has been proposed and evalu- 
ated, demonstrating significant benefits as compared to the current approaches 
of QoE and meaningful KPIs such as stalling probability. The proposed 


in terms 
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strategy can successfully help prevent stallings at the client’s HAS application 
during network coverage problems. The “cost to pay” is the collection and sig- 
nalling of context information, which could however be realistically implemented; 
therefore its adoption in a real network should not be difficult. 

Even though this work focused on outage conditions of zero bandwidth, we 
could easily extend this solution to a more general problem where bandwidth 
may be insufficient (but not zero). Similarly, the same problem could be adjusted 
for cases of an imminent service disruption such as a handover, where the afore- 
mentioned HAS strategy can help prevent stallings during the disruption period 
(i.e. the handover period). This may become possible by exploiting handover- 
hinting information, a priori. In this way, the user will be better prepared for a 
potential interruption in his viewing experience. 

It would be also interesting as future work to study a scenario of more than 
one mobile video streaming users using HAS, and investigate how the decisions 
of one user potentially affect the others. Stability and fairness issues, together 
with QoE analysis would be of great interest in this case. 

Finally, as a general comment, we would like to point out that this work 
could be revisited once a standard QoE model for HAS becomes available. In 
that case, we could have the opportunity not only to produce a more accurate 
optimization problem, but also to enhance the proposed HAS strategy, focusing 
on the key factors that mostly influence the end-users’ QoE. 
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Abstract. This chapter presents scalable conceptual and analytical performance 
models of overall telecommunication systems, allowing the prediction of multiple 
Quality of Service (QoS) indicators as functions of the user- and network 
behavior. Two structures of the conceptual presentation are considered and an 
analytical method for converting the presentations, along with corresponding 
additive and multiplicative metrics, is proposed. A corresponding analytical 
model is elaborated, which allows the prediction of flow-, time-, and traffic char- 
acteristics of terminals and users, as well as the overall network performance. In 
accordance with recommendations of the International Telecommunications 
Union’s Telecommunication Standardization Sector (ITU-T), analytical expres- 
sions are proposed for predicting four QoS indicators. Differentiated QoS indi- 
cators for each subservice, as well as analytical expressions for their prediction, 
are proposed. Overall pie characteristics and their causal aggregations are 
proposed as causal-oriented QoS indicators. The results demonstrate the ability 
of the model to facilitate a more precise dynamic QoS management as well as to 
serve as a source for predicting some Quality of Experience (QoE) indicators. 


Keywords: Overall telecommunication system - Performance model 
Overall causal QoS indicator - Dynamic QoS management 
Telecommunication subservices - Differentiated QoS subservice indicator 
QoS prediction - Human factors of QoS 


1 Introduction 


The telecommunication service is the basis for the Information Service Networks. From 
the very beginning the Internet began its existence as a packet-based communication 
system without guarantees for the quality of the services, which are provided on a best- 
effort basis. At the same time, with the evolution of hardware technologies, and services 
and applications becoming more and more complex, the quality of service (QoS) has 
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become a hot topic and the term “Internet QoS” has widely spread. The question of 
providing QoS guarantees in the Internet is still open (c.f., for instance, the history of 
the Internet Engineering Task Force (IETF) standards for Integrated Services (IntServ) 
and Differentiated Services (DiffServ) as well as the Third Generation Partnership 
Project (3GPP)/European Telecommunications Standards Institute (ETSI) Internet 
Protocol (IP) Multimedia Subsystem (IMS) initiative). 

The QoS has many aspects — QoS offered by the provider, QoS delivered (QoSD), QoS 
achieved by the provider, QoS experienced by the user/customer (QoSE or QoSP — QoS 
perceived) and others. “The understanding of QoSE is of basic importance for the opti- 
mization of the income and the resources of the service provider” [1]. A new attitude 
towards the QoS has become dominant — QoS and Quality of Experience (QoE) are 
considered as goods. The agreement is made according to the perceived quality — Expe- 
rience Level Agreement (ELA) [2]. This approach considerably increased interest in the 
perceived quality among researchers, providers, and users of telecommunication services. 

As aresult of the intensive research, the definition of QoE evolved and at the moment 
the QoE is perceived as a degree of satisfaction or irritation of the users of some appli- 
cation or service which is a result of the fulfillment of their expectations about the utility 
or/and the satisfaction from the application or service in the context of the user’s person- 
ality and the current state [3, 4]. The QoS perceived by users depends not only on the 
quality offered by the provider but also on the context of the services, including the 
techno-socio-economic environment, user’s context, and other factors. The importance 
of the teletraffic models, particularly of the overall QoS indicators, for QoE assessment 
is emphasized by Fiedler [5]. 

From among the many services provided by a telecommunication system, this 
chapter deals with flow-, time-, and traffic characteristics of the connection and commu- 
nication services. The other QoS characteristics of information transmission service are 
reflected partially and indirectly as a probability of the call attempt abandoning by users. 

The main objective of the authors of this chapter is the development of scalable 
performance models of overall telecommunication systems, as a part of Information 
Service Networks, including many of the observable system-dependent factors deter- 
mining the values of QoS indicators. 

These models may be used for multiple purposes but the aim of this chapter is to 
develop prediction models for some key QoS indicators’ values, as functions of the user 
behavior and technical characteristics of the overall telecommunication system. Such 
values may be useful for the network design, for the management of telecommunication 
systems’ QoS, and as a source for predicting some QoE indicators. 

The work presented in this chapter continues the development of the approach for 
the conceptual and analytical modeling of overall telecommunication systems (with QoS 
guarantees), presented in [6]. 

Firstly, in Sect. 2, a scalable conceptual model of an overall telecommunication 
system with QoS guarantees is presented. Two structures of the conceptual presentation 
are compared — the normalized structure and the pie structure. An analytical method for 
converting the presentations, along with corresponding additive and multiplicative 
metrics, is proposed. A qualitative extension of the conceptual model, in comparison 
with [7], is proposed. This includes two new service branches corresponding to the cases 
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of ‘called party being busy with another call’ and ‘mailing a message’. This allows 
analyzing telecommunication systems’ QoS indicators as a composition of QoS indi- 
cators of consecutive and parallel subservices. 

The developed model is based on: a Bernoulli-Poisson—Pascal (BPP) input flow; 
repeated calls; limited number of homogeneous terminals; 11 cases of losses of call 
attempts (due to abandoning, interrupting, blocking, and unavailable service); and three 
successful cases (normal interactive communication, communication after call holding, 
and mailing). The calling (A) and called (B) terminals (and users) are considered sepa- 
rately, but in interaction to each other. This allows formulation of QoS indicators sepa- 
rately for A-, B-, and AB-terminals. 

In Sect. 3, on the basis of the developed conceptual model, a corresponding analytical 
model is elaborated. User behavior parameters and technical characteristics of the tele- 
communication network serve as an input for the model. The model itself is intended 
for systems remaining in a stationary state. It is insensitive to the distributions of random 
variables and provides results in the form of mean values of the output parameters. The 
model is verified for the entire theoretical interval of network load. It allows the predic- 
tion of flow-, time-, and traffic characteristics of A-, B-, and AB-terminals (and users), 
as well as of the overall network performance. 

In accordance with recommendations of the International Telecommunications 
Union’s Telecommunication Standardization Sector (ITU-T), analytical expressions for 
the prediction of three QoS indicators are proposed: 


e Carried Switching Efficiency, for finding B-terminal (Subservice 1); 

e B-Terminal Connection Efficiency, for connection to the B-terminal, which aggre- 
gates the Carried Switching Efficiency; 

e Overall Call Attempt Efficiency, for call attempts finishing with fully successful 
communication, which aggregates the Carried Switching Efficiency, B-Terminal 
Connection Efficiency, Finding B-User Subservice, and Communication Subservice. 


Four differentiated QoS indicators for each subservice are proposed along with analytical 
expressions for their prediction: 


e Carried Switching Efficiency (Ecs), for finding B-terminal (Subservice 1) as per the 
ITU-T recommendations; 

e QoS specific indicator (Qb) of Connection to the B-terminal (Subservice 2); 

e QoS specific indicator (Qu) of Finding B-user (Subservice 3); 

e QoS specific indicator (Qc) of Communication (Subservice 4). 


The four QoS specific indicators are independent. They are components (in multipli- 
cative metrics) of the ITU-T concordant Overall Call Attempt Efficiency indicator (Ec): 


Ec = Ecs Qb Qu Qc. 


The usage of the proposed QoS indicators of telecommunication subservices allows 
conducting a more specific QoS analysis and more adequate QoS management. 

In Sect. 4, in accordance with the ITU-T recommendations, analytical expressions 
for the prediction of the Overall Traffic Efficiency Indicator and other overall pie 
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parameters and their causal aggregations are proposed and illustrated numerically. The 
overall pie characteristics and their causal aggregations could be considered as causal- 
oriented QoS indicators. The results allow a more precise estimation of the dynamic 
importance of each reason of call attempts finishing and thus a more precise dynamic 
effort targeting of the QoS management. 

In the Conclusion, possible directions for future research are discussed. 


2 Conceptual Model 


2.1 Background 


At the telecommunication system level, Ericson has proposed a reference model 
consisting of five parts — terminals, access-, transport-, network management-, and 
network intelligence part [8]. We have extended this reference model by making differ- 
ence between the telecommunication system and the telecommunication network, and 
by applying the present ITU-T terminology (Fig. 1). It contains seven parts (subsystems): 
(1) Network Environment (natural-, technological-, and socio-economic environment); 
(2) Users; (3) Subscribers/Customers!; (4) Terminals; (5) Telecommunication Network; 
(6) Network’s Information Servers (network intelligence); and (7) Telecommunication 
Administration (network service provider). The interaction between subsystems (if any) 
is presented by a common border between their representing rectangles in Fig. 1. Each 
subsystem is part of the environment (context) of the other subsystems. 


Subscribers/ 


Network’s 
Customers 


Information 
Servers 


Legend: 
E| Detailed interactive behavior description 


E Indirect (equipment characteristics 
& interruption probabilities) 


E Target values & restrictions 
L] Not considered in the low-level models 


Fig. 1. A reference model of an overall telecommunication system and its environment (an 
extension of [9]). 


T : Ja : I rE 
According to [1], the user is “A person or entity external to the network, which utilizes connec- 
tions through the network for communication”, whereas the customer is “A user who is 
responsible for payment for the services”. 
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For designing and managing telecommunication systems one needs scalable models 
in all aspects of the term ‘scalability’: “scale down: make smaller in proportion; reduce 


99, 66 


in size”; “scale up: make larger in proportion; increase in size”; “to scale: with a uniform 


reduction or enlargement” [10]. Models’ scalability includes: temporal-, spatial-, struc- 
tural-, parametric-, conceptual-, functional-, and etc. scalabilities. 


Basic Virtual Devices: At the bottom of the structural model presentation, we consider 
‘basic virtual devices’ that do not contain any other virtual devices. A basic virtual device 
has the graphic representation as shown in Fig. 2. 


External Flows: 
F (1-Px) F 


Device “x’, with parameters’ names: 


Fx, Px, Tx, Nx, Yx, Vx. 


Fig. 2. A graphical representation of a basic virtual device x. 


Parameters of the basic virtual device x are the following (c.f. [11] for terms definition): 


F,- Intensity or incoming rate (frequency) of the flow of requests (i.e. the number of 
requests per time unit) to device x; 

P,.— Probability of directing the requests towards device x; 

T,,— Service time (duration of servicing of a request) in device x; 

Y,- Traffic intensity [Erlang]; 

V,.— Traffic volume [Erlang - time unit]; 

N, — Number of lines (service resources, positions, capacity) of device x. 


Functional Normalization: In our models, we consider monofunctional idealized basic 
virtual devices of the following types (Fig. 3): 


Generator — this device generates calls (service requests, transactions); 
Terminator — this block eliminates every request entered (so it leaves the model 
without any traces); 

e Modifier — this device changes the intensity of the incoming flow, creating or nulli- 
fying requests. It is used to model the input flow, in conformance with the system 
status (c.f. Fig. 7); 

e Copier — this block creates copies of the requests received and directs them to a route 
different from the original one; 

e Director — this device unconditionally points to the next device, which the request 
shall enter, but without transferring or delaying it; 
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e Enter Switch — this block checks if there is a free resource/place in the next block for 
a request to be accommodated in: if yes, the request is passed to it without delay; if 
not — the request is re-directed to another device; 

e Server — this device models the delay (service time, holding time) of requests in the 
corresponding device without their generation or elimination. It models also traffic 
and time characteristics of the requests processing (c.f. Fig. 2); 

e Transition -this device selects one of its possible exits for each request entered, thus 
determining the next device where this request shall go to; 

e Graphic Connector — this is used to simplify the graphical representation of the 
conceptual model structure. It has no modeling functions. 


C1] >) 4) 4 ( —- 


Generator Terminator Modifier Copier Director 
Server Transition Enter Switch Graphic Connector 


Fig. 3. A graphical block representation of the main basic virtual mono-functional devices used. 


Structural Normalization: Following the theorem of Bohm and Jacopini [12], we use 
basic virtual devices mainly with one entrance and one exit. Exceptions are: the transi- 
tion device, which in our structural normalization has one entrance and two exits (for 
splitting the requests’ flows) or two entrances and one exit (for merging the requests’ 
flows); and the copier with its one entrance and two exits. 


Causal Structure Presentation: Any service may end due to many reasons. In a tele- 
communication network, all reasons are classified into four types: network failures, user 
failures (ineffective calls associated with the callers and callees), network service 
provider failures, and successful ending (completed seizures) [13, 14]. The ‘cause value’ 
field in [14] contents 99 items. In [13], there are 127 ‘cause value’ numbers. Cisco lists 
131 ‘call termination cause codes’ and 44 ‘Cisco-specific call termination cause 
codes’ [15]. 


Complex Virtual Devices: Each reason for service ending has its own probability to 
occur and mean service time (duration). In our conceptual model, the service execution 
goes through different stages (e.g. dialing, switching, ringing, etc.), each consisting of 
different phases. Each stage of a modeled service corresponds to one (or more) complex 
virtual device and contains “service branches’ (service phases). Typically, a service 
phase includes a service device and all necessary auxiliary devices such as queues, entry 
and exit devices, as well as virtual devices reflecting the user behavior, associated with 
this phase, e.g. the waiting time before initiating a repeated call attempt. Each service 
branch corresponds to a different reason of service ending. The service branches form 
the ‘causal structure’ of the modeled service. The causal structure of a complex virtual 
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device x (with input requests’ flow frequency F’,, mean service time T „ and traffic inten- 


sity Y,) could be presented in two ways — by using a normalized structure or a pie structure 
(Fig. 4). 


1-P. 1-P Pet T : 
Fx -Pn,1 -Pn,k-1 
>O »>O>--->0 
m 1 = Nu acd F Pp,2 
ee (ee a | £| a} x Pe Ja E 
k 1 2 |---| k 
x Pp,k i 
Y Y Y x Hk Z 
(a) (b) (c) 


Fig. 4. (a) A complex virtual device x, representing a service with k reasons for ending; (b) the 
normalized causal structure of device x; (c) the pie causal structure of device x. 


Both structures include k virtual ‘causal devices’, each with its own mean input 
requests’ flow frequency F, mean service time T, and traffic intensity Y,, Obviously: 


Yay, te. (1) 


The difference between the two presentations is in the internal flow structures only. 
In the pie causal structure (Fig. 4c), all causal service branches have common beginning. 


The probability P,, ; shows what part (pie) of the service incoming flow is directed to the 


causal device i. All probabilities P, ; are dependent: 


YP = 1. (2) 


In the normalized causal structure (Fig. 4b), all service branches are ordered consec- 


utively as derivations of one ‘successful completed service branch’. The probability P, ; 
shows what part of the flow, already passed through the previous causal branches, is 
derived to the considered service case (causal device) i. The probabilities P,, ; are inde- 
pendent (orthogonal, normal). The order of causal branches does not matter (has no 
mathematical meaning) but usually the branch of successful completion of the service 
(Pag) is the last one. 

Both structures lead to different presentations of the same QoS indicators. For 
example, the probability (resp. efficiency Ec) for successful completion of the service 
in the normalized (3) and pie presentation (4) is respectively: 


k-1 


E.=|][a-?P,,. (3) 


i=1 
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k-1 
E =i- Py (4) 


i=l 


The normalized- and pie structures are used by many authors but usually without 
these associated names, and without discussions about the nature of parameters and how 
one structure could be converted to the other. For example, in [16] expressions like (2), 
(3) and (4) are classified as ‘aggregation functions’, whereas (2) is additive, (3) is multi- 
plicative, and (4) is not specified. 

The conversion between the values of the normalized and pie probabilities (and 
vice versa) could be done by means of the following system of k equations with k 


variables (P, ; or Lad = 1,2,3,...,k): 


P= Ps if j=! 


j-l 
p= Pa LOE PI = eee sk 


Py, 

Each structure has advantages over the other. The normalized structure allows clearer 
conceptual presentation and simpler inference of the analytical models, but normalized 
probabilities depend on the causal branch positions. The pie structure is more natural 
and impressive in business presentations (pie charts, pie graphs). Each structure is a 
mathematical equivalent of the other. Both allow for model scalability. 


2.2 Conceptual Model 


We consider a virtual overall telecommunication system including users, terminals and 
possibly several telecommunication networks, operated by different operators. We 
consider VNET carrying Class 0 traffic (real-time, jitter-sensitive, with high interaction 
(Voice over IP (VoIP), video teleconference) [17]. The VNET utilizes virtual channel 
switching principles, following the main method for traffic QoS guaranties — resource 
reservation [18]: “Bandwidth reservation is recommended and is critical to the stable 
and efficient performance of Traffic Engineering methods in a network, and to ensure 
the proper operation of multiservice bandwidth allocation, protection, and priority treat- 
ment.” 

In our approach, the overall network QoS parameters are aggregation of all end-to- 
end QoS parameters of all terminals and connections in the network, within the consid- 
ered time interval (Fig. 5). 

The VNET in Fig. 5 includes also users, not just the terminals, and generalizes call 
intensity, time- and traffic parameters of the calling (A), called (B) and all active (AB) 
terminals, as well as of the overall network equivalent switching lines, reflecting 
resources of all comprised telecommunication networks. 

In this chapter, we propose a considerable extension of the conceptual and analytical 
performance models of the overall telecommunication system with QoS guarantees, 
described in [6]. This includes two new service branches corresponding to the cases of 
‘called party being busy with another call’ and ‘mailing a message’. 
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A-Terminals Li 
Fa, Ta, Ya Overall Overall Network 
Network Base QoS 

Overall Terminal Equivalent Output 

Subsystem Switching Parameters 
Nab, Fo, Tab, Yab Lines Considered: 

B-Terminals a Ns, Fs, Ts, Ys Pbs, Pbr. 

Fb, Tb, Yb 


u 


Fig. 5. A generalized VNET, including users and terminals, with overall QoS guaranties (a 
modification of [19]). 


r 


Basic Virtual Devices’ Name Notation. In the normalized conceptual model, each 


virtual device has a unique name, depending on its position and the role it plays in the 
model (Figs. 6, 7, 8, 9 and 10). 


Virtual Device Name = <BRANCH EXIT><BRANCH><STAGE> 


BRANCH EXIT: BRANCH: STAGES: 
r= repeated; e = enter dialling; 
t = terminated (not & = abandoned; Switching: 
considered usually). b = blocked: 9 
=. 7 ringing; 
I = interrupted; holding: 
n = not available; OO": 
C = carried. Communication; 


Mailing (voice). 


Fig. 6. The basic virtual devices’ name notation. 


The model is partitioned into service stages (dialing, switching, ringing, holding, 
communication, and mailing). 

Each service stage has different branches (entered, abandoned, blocked, interrupted, 
not available, carried), corresponding to the modeled possible cases of ending the 
service. 

Each branch has two exits (repeated, terminated) that show what happens with the 
service request after it enters the telecommunication system. Users may make a new bid 
(repeated service request) or may stop attempting (terminated service request). 

In the virtual devices’ name notation, the corresponding first letters of the name of 
the branch exit, the branch, and the service stage are used (in this order) to form the 
name of the virtual device: 


Virtual Device Name = < BRANCH EXIT >< BRANCH >< STAGE > 


Complex Virtual Devices’ Names. We use the following complex virtual devices (i.e. 
devices, consisting of several basic virtual devices): 
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a — a virtual device that comprises all A-terminals (i.e. the calling terminals) in the 
system. The a device is represented as a ‘dotted line’ box, named a0 in Fig. 7, a 1in 
Fig. 8, a2 in Fig. 9, and a3 in Fig. 10; 

b — a virtual device that comprises all B-terminals (i.e. the called terminals) in the 
system. The b device is represented as a ‘dashed line’ box, corresponding to the B- 
terminal load, in Figs. 8, 9, and 10; 

ab — this device comprises all the active (i.e. calling and called) terminals in the system; 
s — a virtual device corresponding to the equivalent connection lines in the switching 
system. It is represented as a ‘dotted and dashed line’ box, named s, inside the a0 box 
in Fig. 7, and other a boxes (a 1 in Fig. 8, a2 in Fig. 9, and a3 in Fig. 10). 


The network environment includes also basic virtual devices outside the a and b 
complex devices. Service requests in the environment do not occupy network devices, 
but rather form incoming flows out of demand and repeated call attempts. 


rep.Fa rad || rid rbs || ris || rns rbr 

ci ee ee 

Stage: Dialing Switching Ringing 
(begin) 


Fig. 7. Service stages ‘Dialing’, ‘Switching’, and the beginning of stage ‘Ringing’. 


In Fig. 7, Fo is the intent intensity of calls*, with a Poisson distribution, generated 
by a terminal; dem. Fa’ is the intensity of demand (first, primary calls), generated by all 
A-terminals, according the BPP-traffic model (c.f. the modifier block in Fig. 7); M is a 
constant. In our approach, every value of M within the interval [—1, +1] is allowed. If 
M = -1, the intensity of the demand flow corresponds to the Bernoulli (Engset) distri- 
bution; if M = 0 — to the Poisson (Erlang) distribution; and if M = +1 — to the Pascal 
(Negative Binomial) distribution. 


? In this chapter, the term ‘call’ means ‘service request’, ‘call attempt’ or ‘bid’ according to the 
terminology in [11]. 

* In the expressions, formulas and figures, the sign (.) is used only as a separator and NOT as a 
sign of multiplication. The multiplication operation is indicated by a gap between multiplied 
variables, e.g. X Y. 
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rep.Fa stands for repeated attempts, generated by A-users and A-terminals, in the 
case of unsuccessful call attempts; Fa is the flow generated by and occupying the A- 
terminals (it is a sum of the intensities of primary (demand) call attempts (dem.Fa) and 
repeated attempts rep.Fa). 

Devices ‘entered dialing’ (ed), ‘carried dialing’ (cd), and ‘carried switching’ (cs), 
belong to the successful service branch. 

Devices ‘abandoned dialing’ (ad), ‘interrupted dialing’ (id), ‘blocked switching’ 
(bs), ‘interrupted switching’ (is), ‘not available switching (service, number)’ (ms), and 
‘blocked ringing’ (br) belong to the unsuccessful (due to different reasons) service 
branches. They reflect durations of the correspondent signaling, e.g. the “busy tone’ 
duration. 

Devices ‘repeated abandoned dialing’ (rad), ‘repeated interrupted dialing’ (rid), 
‘repeated blocked switching’ (rbs), ‘repeated interrupted switching’ (ris), ‘repeated not 
available switching (service, number)’ (rns), and ‘repeated blocked ringing’ (rbr) corre- 
spond to the duration of users’ requests waiting, outside the network equipment, before 
the next repeated call attempt. 

The device of type ‘Enter Switch’ (just before the ‘blocked switching’ (bs device) 
in Fig. 7) deflects calls if there is no free line in the switching system, with probability 
of blocked switching (Pbs). The second ‘Enter Switch’ device (after the block ‘carried 
switching’ (cs) in Fig. 7) deflects calls, with probability of blocked ringing (Pbr), if the 
called B-terminal is busy. 

Note that there is no B-terminal traffic in the part of the conceptual model, presented 
in Fig. 7. 


a ŒD 
Ler tole] 
it 
| ar ac 
rar rac || rect aT Seances aa ae ! 
<> b i i 
Stage: Ringing Communication B-Terminal Load 
(end) (Case 1) (Case 1) 


Fig. 8. Service stages ‘Ringing’ (end), ‘Communication’, and ‘B-terminal Load’ (Case 1). 


Figure 8 presents the call flows, in Case 1, when the B-terminal is found free (c.f. 
Connector 1 in Figs. 7 and 8). In this case, the flow intensity, occupying the B-terminal, 
is generated by the Copy device (c.f. Connector B1), because at the beginning of the 
ringing stage, the B-terminal becomes busy. In Case 1, the traffic load on the A-terminal 
equals the traffic load on the B-terminal. 
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Fig. 9. Service stages ‘Holding’, ‘Communication’, and ‘B-terminal Load’ (Case 2). 


Figure 9 presents the call flows, in Case 2, when the B-terminal is found busy (c.f. 
Connector 2 in Figs. 7 and 9). This is the case of call holding — the A-user is put to wait 
(virtual devices ‘carried holding’ (ch) and ‘abandoned holding’ (ah)). In pure voice 
communication systems, in this case, a pre-recorded music/message is usually played 
to the caller while waiting. The connection is not terminated but no verbal communi- 
cation is possible. At the same time the B-user is notified (by a sound and/or light indi- 
cation on his/her terminal/phone) that another call is trying to reach him/her, with the 
options of answering (virtual devices ‘carried holding’ (ch)) or not answering it (virtual 
device ‘abandoned holding’ (ah)). During the hold time, the B-user is able to continue 
with or answer another call, retrieve a waiting call, etc. Note that in this case, traffic 
loads on the A- and B-terminals are considerably different. 


C3 > em fi ~(copy |+| cm 
| ŒD 
Las Lam n aes 
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A po Len) 
ram rem + 
= 
Stage: Mailing Communication! B-Terminal Load 
(Case 3) (Case 3) 


Fig. 10. Service stages ‘Mailing’, ‘Communication’, and ‘B-terminal Load’ (Case 3). 


Figure 10 presents the call flows, in Case 3, when the B-terminal is found busy (c.f. 
Connector 3 in Figs. 7 and 10). This is the case when the A-user is redirected to a mail 
service to leave an audio message. In some systems, there is also a possibility to leave 
a video message, e.g. a visual voicemail. The A-user receives an invitation to leave a 
mail message (virtual device ‘enter mailing’ (em)) and may decide to use this service 
(virtual device ‘carried mailing’ (cm)) or to abandon the service (virtual device ‘aban- 
doned mailing’ (am)). The message is retrieved (later) by the B-user either as audio 
directly from his/her terminal/phone or from another device via a web link supplied by 
an email message, or as a text by utilizing a voicemail-to-text functionality. This message 
retrieval is reflected by the case of using the B-terminal by the B-user in our conceptual 
model. 
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Parameters’ Qualification. In Fig. 7, one may see notations ‘Fa’, ‘dem.Fa’, and 
‘rep.Fa’, using qualifiers dem and rep. Traffic qualification is necessary and it is used in 
[11], but without any attempt for including the qualifiers in the parameters’ names. The 
problem is more complex: (1) one would like to have the same, or very similar, param- 
eters’ names in the conceptual-, analytic-, and computer models; (2) one would like to 
meet the Name Design Criteria: “Names with which human beings deal directly should 
be user-friendly. A user-friendly name is one that takes the human user’s point of view, 
not the computer’s. It is one that is easy for people to deduce, remember and understand, 
rather than one that is easy for computers to interpret.” [20], Annex J: “Name Design 
Criteria”. 

Since 2006 [6] we use up to two qualifiers as a part of the parameter’s name. The 
first is for the parameter value’s origin, e.g. emp for ‘empirical’, dsn for ‘designed’, trg 
for ‘target’, etc. The second qualifier characterizes the traffic. Most of the traffic qualifiers 
are described in [11]. In this paper we use dem for ‘demand’, rep for ‘repeated’, ofr for 
‘offered’, and crr for ‘carried’. We expand the meaning of the traffic qualifiers to the 
other parameters determining the traffic, e.g. in our notations, ofr. Ys = ofr.Fssrv.Ts 
means: ‘the offered traffic intensity to the switching system is a product of the offered 
requests’ frequency (rate) and the service time in the switching system. 

The definition of the offered traffic needs more explanations. There are two offered 
traffic definitions in the ITU-T recommendations: (1) Equivalent Traffic Offered [21]; 
and (2) Traffic Offered [11]. In the other standardization documents, there is only one 
offered traffic definition, close to the Equivalent Traffic Offered [21]. In the overall 
network performance models, both definitions give considerably different values [22]. 
In this chapter, we use only the definition of the Equivalent Traffic Offered [21]. 


2.3 QoS Prediction Task Formulation 


We consider the conceptual model presented in Figs. 6, 7, 8, 9, 10 and described in 
Sect. 2.2. In this chapter, we consider that the overall telecommunication system 
provides four services: (1) finding B-terminal; (2) connection to B-terminal; (3) finding 
B-user (with sound, vibration, message, etc.); and (4) transmission and/or record of 
messages. The quality of this services depends on many subsystems (c.f. Fig. 1), 
including the user- and network behavior. 

Types of Parameters. There are two types of parameters — static and dynamic. The 
10 basic dynamic parameters (with values dependent of the system state) are: Fo, Yab, 
Fa, dem.Fa, rep.Fa, Pbs, Pbr, ofr.Fs, Ts, and ofr.Ys. All others dynamic parameters can 
be obtained from these. 

Note that the traffic Yab from all terminals is accepted as a system macro-state 
parameter. 

Input Parameters. These are mostly static, i.e. related to the network technical char- 
acteristics or the user behavior. We choose one dynamic parameter - Fo (the intent 
intensity of calls of one idle terminal) as an independent input variable. The proposed 
analytical model allows to find all dynamic values, if Fo and all static parameters are 
known. 
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The probability of finding the B-user is considered static (i.e. independent of the 
system state). 
The basic QoS output parameters are: 


e Quality of finding the B-terminal service, represented by the probability of call 
blocking due to unavailable network equipment (equivalent network switching lines) 
— blocked switching (Pbs); 

e Quality of connection to the B-terminal, represented by the probability of call 
blocking due to busy B-terminal — blocked ringing (Pbr). 


These two parameters allow determination of many other QoS indicators, related to 
traffic-, time-, and flow characteristics of users and terminals. 

The goal of this section is to find analytically all unknown basic dynamic parameters, 
including the basic QoS output parameters. 


2.4 Main Assumptions 


For a clear analytical modeling of a telecommunication system with QoS guarantees, 
the following assumptions were made: 


Assumption 1. A closed service system, presented in Figs. 6, 7, 8, 9 and 10, is 
considered; 


Assumption 2 (Capacity of Devices). The switching system (s) has capacity of Ns 
connections (every virtual internal switching line may carry only one call attempt). 
Complex devices have limited capacity: the capacity of the ab device is Nab € [2, oo) 
terminals; the capacity of every terminal is engaging in one call (incoming or outgoing); 
all basic virtual devices have unlimited capacity; 


Assumption 3 (Occupation of A-terminals). Every incoming call attempt (Fa), from 
the environment, falls only on a free A-terminal. This terminal becomes a busy one; 


Assumption 4 (Steady State). Every device is in a stationary state. Hence the Little’s 
theorem [23] is applicable to each device: Y = FT; 


Assumption 5 (Capacity of Call Attempts). Every call attempt may occupy no more 
than one place, if any, in each basic virtual device; 


Assumption 6 (Network Environment). The calls and devices in the environment 
(outside blocks a and b in Figs. 7, 8, 9 and 10) form the intent- and repeated calls flows). 
They don’t create telecommunication network’s load; 


Assumption 7 (Device Independence). Excluding the dependences described in the 
mathematical model, all parameters of a virtual device are independent from the param- 
eters’ values of any other virtual device in the model; 
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Assumption 8 (Randomness of the Processes). All variables in the analytical model 
are considered random with a fixed distribution; the Little’s theorem is used for working 
with their mean values. 


Assumption 9 (A and B Simultaneity). If a call attempt is served in corresponding 
virtual devices belonging to A- and B-terminal’s load (e.g. ar, cr, ac, cl, cc2, cm in 
Figs. 7, 8, 9 and 10), it seizes and releases them simultaneously, with the same service 
load and duration. 


Assumption 10 (Virtual Channel Switching). Every call attempt occupies simulta- 
neously places in all the basic virtual devices of the complex device a or b it is passing 
through, including the basic device where it is at the moment of observation. Every 
call attempt releases all occupied places at the very moment it leaves the complex 
device a or b. 


Assumption 11 (Homogeneity*). All terminals and users are homogeneous. 


Assumption 12 (Self-Excluding). Every A-terminal directs, with uniform distribution, 
all its call attempts to other terminals, not to itself; 


Assumption 13 (B-flow). The flow of call attempts, occupying B-terminals (Fb), is 
ordinary. (The case when two or more call attempts reach simultaneously a free B- 
terminal is not considered, due to its statistical unimportance); 


Assumption 14 (B-terminal Busy Probability). The stationary probability of a call to 
find the intended B-terminal busy (‘blocked ringing’ (Pbr)) during the first (primary, 
demand) attempt and all subsequent (repeated) attempts is one and the same. 


3 Analytical Model 


3.1 Overall Input Flow Intensity 


The input (incoming) flow to the telecommunication network, with intensity Fa, is the 
flow generated by (and occupying) A-terminals. From the ITU E.600 definitions and 
Fig. 7 it is obvious that the intensity of incoming flow is a sum of the intensities of 
primary (demand) call attempts (dem.Fa) and repeated attempts (rep. Fa): 


Fa = dem.Fa + rep.Fa. (6) 
From the definition of the BBP-flow and Fig. 7 we have: 


dem.Fa = Fo(Nab + M Yab) (7) 


4 ; Pog i : 
Homogeneity means that all relevant characteristics and their considered mean values are the 
same. 
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3.2 QoS Indicator 1: Carried Switching Efficiency 


According to Definition 2.11 in [11]: “fully routed call attempt; successful call attempt” 
is “A successful call attempt that receives an answer signal”. We define the Carried 
Switching Efficiency of the ‘Finding B-Terminal’ service as a ratio of the flow intensity 
of the calls reaching the intended B-terminal (Fcs) and receiving an answer signal ‘busy 
tone’ or ‘ringing tone’, to the incoming call attempts intensity (Fa). 

The Carried Switching Efficiency corresponds to the concept of “answer bid ratio 
(ABR)” in [11]: “On a route or a destination code basis and during a specified time 
interval, the ratio of the number of bids that result in an answer signal, to the total number 
of bids.” 

In the conceptual model considered (c.f. Fig. 7), the calls served in the device ‘carried 
switching’ (Fcs) are those, reaching the B-terminals. The intensity Fcs may be calculated 
by taking into account Fa and losses on the way to the cs device (c.f. Fig. 7). This, 
expressed in two ways — by using the lost call flows and probabilities of successful 
moving of requests along the successful branch, results in the following: a 


Fcs = Fa(\ — Pad) (1 — Pid) (1 — Pbs) (1 — Pis) (1 — Pns). (8) 
So, the Carried Switching Efficiency (Ecs) of the ‘Finding B-Terminal’ service is: 


Ecs = m = (1 — Pad)(1 — Pid)(1 — Pbs)(1 — Pis)(1 — Pns). (9) 
a 


3.3 Repeated Calls Flow 


Based on the repeated calls definition [21] and the proposed conceptual model (Figs. 7, 
8, 9 and 10), the intensity of the repeated attempts (rep.Fa) is: 


rep.Fa = Frad + Frid + Frbs + Fris + Frns + Frbr + Frl + Fr2 + Fr3, (10) 


where Frl = Frar + Frac + Frcc1 is the intensity of repeated attempts in Case 1, 
directed to Connector 4 (c.f. Fig. 8); Fr2 = Frah + Frac + Frcc2 is the intensity of 
repeated attempts in Case 2, directed to Connector 4 (c.f. Fig. 9.); Fr3 = Fram + Frem 
is the intensity of repeated attempts in Case 3, directed to Connector 4 (c.f. Fig. 10). 


Proposition 1. The intensity of the repeated attempts rep.Fa may be obtained as: 


rep.Fa = Fa(Pad Prad + (1 — Pad)(Pid Prid + (1 — Pid)(Pbs Prbs 
+ (1 — Pbs)(Pis Pris + (1 — Pis)(Pns Prns + (1 — Pns)Pbr (Ph Pr2 (11) 
+ (1 — Ph)(Pm Pr3 + (1 — Pm) Prbr)) + (1 — Pbr) Pr1))))), 
where Ph (‘holding’) is the probability of calls going to Case 2 (c.f. Connector 2 in 


Fig. 7), Pm (‘mailing’) is the probability of calls going to Case 3 (c.f. Connector 3 in 
Fig. 7), and: 
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Pri = Ecs (Pah Prah + (1 — Pah) (Pac Prac + (1 — Pac) Prcc1)); (12) 
Pr2 = Ecs Pbr Ph (1 — Par) (Pah Prah + (1 — Pah)(Pac Prac + (1 — Pac)Prec2)); (13) 
Pr3 = Ecs Pbr (1 — Ph) Pm (Pam Pram + (1 — Pam)Prcm). (14) 


Proof: As can be seen from Figs. 7, 8, 9 and 10, Assumption 1, and (10), rep.Fa is a 
sum of intensities of repeated attempt flows, in all branches. The intensities of repeated 
attempt flows, in all branches, may be easily expressed as functions of Fa, following the 
conceptual model structure depicted in Figs. 7, 8, 9 and 10: 


Frad = Fa Pad Prad; (15) 
Frid = Fa (1 — Pad) Pid Prid; (16) 
Frbs = Fa (1 — Pad)(\ — Pid) Pbs Prbs; (17) 
Fris = Fa (1 — Pad)(1 — Pid)(1 — Pbs)Pis Pris; (18) 
Frns = Fa (1 — Pad)(1 — Pid)(1 — Pbs)(1 — Pis)Pns Prns; (19) 
Frbr = Fa (1 — Pad)(1 — Pid)(1 — Pbs)( — Pis)(1 — Pns)Pbr Prbr; (20) 
Frl = Fa (1 — Pad)( — Pid)(1 — Pbs)(1 — Pis)\(1 — Pas) — Pbr) (Pah Prah 
+ (1 — Pah)(Pac Prac + (1 — Pac)Precl)) = Fa Pri; a” 
Fr2 = Fa (1 — Pad)(\ — Pid)(1 — Pbs)(1 — Pis)\(1 — Pns)Pbr Ph(1 — Par) 
(Pah Prah + (1 — Pah)(Pac Prac + (1 — Pac)Prcec2)) = Fa Pr2; @) 
Fr3 = Fa (1 — Pad)(1 — Pid) — Pbs)(1 — Pis)( — Pns)Pbr (1 — Ph)Pm (23) 


(Pam Pram + (1 — Pam)Prcm) = Fa Pr3. 
By adding Eqs. (15) to (23) and taking into account (10), we obtain (11). 


Proposition 2. By distinguishing static and dynamic parameters in (1 1), and after some 
algebraic operations, we obtain rep.Fa as a simple function of Fa, Pbr, and Pbs: 


rep.Fa = Fa (R1 + R2 Pbr (1 — Pbs) + R3 Pbs), 24) 
where: 


R1 = Pad Prad + (1 — Pad)(Pid Prid + (1 — Pid) Pis Pris 


+ (1 — Pis)(Pns Prns + (1 — Pns) Pr1); (25) 


R2 = (1 — Pad)(1 — Pid)(1 — Pis)(1 — Pns)(Ph Pr2 + (1 — Ph)(Pm Pr3 


+ (1 — Pm)Prbr) — Pr); (26) 


R3 = (1 — Pad)(1 — Pid)(Prbs — (Pis Pris + (1 — Pis)(Pns Prns + (1 — Pns)Pr1))). (27) 
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3.4 QoS Indicator 2: B-Terminal Connection Efficiency 


Definition 2.10. in [11] describes “completed call attempt; effective call attempt’ as “A call 
attempt that receives intelligible information about the state of the called user”. 

Based on this, we define the B-Terminal Connection Efficiency as a ratio of the flow 
intensity of the calls occupying the intended B-terminal (Fb) to the incoming call 
attempts’ intensity (Fa). 

In the considered conceptual model, the calls occupying the B-terminal receive infor- 
mation about the state of the called B-user such as signals ‘ringing tone’ (Case 1 in Fig. 8), 
‘holding signal’ (Case 2 in Fig. 9), or ‘invitation to mailing’ signal (Case 3 in Fig. 10). The 
A-user may accept (devices cr, ch, cm) or reject (devices ar, ah, am) the offers. 


3.5  B-Terminals’ Characteristics 


The intensity of the input flow occupying all B-terminals (Fb) is a sum of the following 
intensities of input flows (to B-terminals): Fb1, in Case | - Ringing stage (generated in 
the copy device in Fig. 8); Fb2, in Case 2 - Communication stage (generated in the copy 
device in Fig. 9); and Fb3, in Case 3 - Communication stage (generated in the copy 
device in Fig. 10), or: 


Fb = Fb1 + Fb2 + Fb3. (28) 


The flow intensities Fb1, Fb2 and Fb3 can be calculated by considering the intensity 
of the carried switching flow Fcs. From Figs. 7, 8, 9 and 10, we obtain directly: 


Fbl = Fes (1 — Pbr) (29) 
Fb2 = Fes Pbr Ph(1 — Pah) (30) 
Fb3 = Fcs Pbr (1 — Ph) Pm (1 — Pam) (31) 


After summation, we obtain Fb as: 


Fb = Ecs Fa ((1 — Pbr) + Pbr (Ph (1 — Pah) 


+ (1 — Ph) Pm (1 — Pam))) = Eb Fa i?) 


where Fb is the B-Terminal Connection Efficiency, or shortly ‘B-Efficiency’. B- 
Efficiency (Eb) is expressed as a ratio of flow intensity, occupying B-terminals (Fb), 
to the intensity of the incoming flow (Fa). It is considerably different from the 
Carried Switching Efficiency (Ecs): 


Eb= y = Ecs((1 — Pbr) + Pbr(Ph(1 — Pah) + (1 — Ph) Pm (1 — Pam))). (33) 
a 
Flow of Call Attempts, Occupying all B-Terminals 


Traffic intensity to B-terminals (Yb) is a sum of traffic intensities (to them) in cases 1, 2, and 
3. From Figs. 8, 9 and 10 and the Little’s theorem, we can obtain directly the following: 


Conceptual and Analytical Models for Predicting the Quality of Service 169 


Yb = Yb1 + Yb2 + Yb3, (34) 

where 
Ybl = Yar + Yer + Yac + Yccl = Fb1 Tbl. (35) 
Yb2 = Yac + Ycc2 = Fb2 Tb2. (36) 
Yb3 = Ycm = Fb3 Tb3. (37) 

and 

Tb1 = Par Tar + (1 — Par)(Tcr + Pac Tac + (1 — Pac) Tcc1) (38) 
Tb2 = Pac Tac + (1 — Pac) Tcc2 (39) 
Tb3 = Tcm (40) 


Proposition 3. Traffic intensity to B-terminals (Yb) may be calculated from the equa- 
tion: 


Yb = Ecs Fa ((1 — Pbr) Tb1 + Pbr (Ph(1 — Pah) Tb2 


+ (1 — Ph) Pm (1 — Pam) Tb3)), en 


where Tb is the mean holding time of calls in B-terminals and Fb is the intensity of call 
attempts that occupy B-terminals. 


Proof: After summation of Yb1, Yb2 and Yb3, and taking into account expressions (2.8)— 
(2.10), we obtain: 


Yb = Fes ((1 — Pbr) Tb1 + Pbr(Ph(1 — Pah) Tb2 + (1 — Ph) Pm Tb3)) 
and after replacing Fcs with Ecs Fa from (9) we get (41). 


Proposition 4. The mean holding time of all B-terminals (Tb), in accordance with cases 
1, 2, 3, is: 


_ Yb _ (L—Pbr)Tb1 + Pbr(Ph(1 — Pah) Tb2 + (1 — Ph) Pm (1 — Pam) Tb3) 


T= 7 (1 — Pbr) + Pbr(Ph(1 — Pah) + (1 — Ph) Pm (1 — Pam)) 


(42) 


Proof: This follows directly from the formulas for Yb and Fb, and by directly applying 
the Littlle’s theorem. 


Consequence: Traffic intensity of B-terminals (Yb) is: 


Yb = Fb Tb = Fa Eb Tb. (43) 
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3.6 A-Terminals’ Characteristics 


In this subsection, analytical expressions characterizing all A-terminals (traffic intensity 
(Ya), intensity of occupation flow (Fa), holding time (Ta)) are obtained, as functions of 
known variables. 


Proposition 5. A-terminals’ traffic intensity (Ya) is: 


Ya = YaO + Yal + Ya2 + Ya3 = FaTa, (44) 
where 
YaO = Yed + Yad + Yid + Ycd + Ybs + Yis + Yns + Yes (45) 
Yal = Ybl, (46) 
Ya2 = Yah + Ych + Yb2, (47) 
Ya3 = Yem + Yam + Yb3. (48) 


Proof: Based on the proposed conceptual model and Figs. 7, 8, 9 and 10, and by 

applying the Little’s theorem, we can obtain the traffic intensity for each virtual device, 

in a0 (Ya0), al (Yal), a2 (Ya2), and a3 (Ya3) blocks, of stages Dialing 

(Yed, Yad, Yid, Ycd), Switching (Ybs, Yis, Yns, Ycs), Holding (Yah, Ych), and Mailing 

(Yem, Yam), and by using the found traffic intensities of B-terminals (Yb1, Yb2, Yb3). 
After summation, we obtain the following: 


Ya = Fa Ta = Fa (Ted + Pad Tad + (1 — Pad)(Pid Tid + (1 — Pid)(Tcd + Pbs Tbs 
+ (1 — Pbs)(Pis Tis + (1 — Pis)(Pns Tns + (1 — Pns)(Tcs + (1 — Pbr) Tb1 + Pbr (Tbr 
+ Ph(Pah Tah + (1 — Pah)(Tch + Tb2) 
+ (1 —Ph)Pm (Tem + Pam Tam + (1 — Pam)Tb3))))))))). 


(49) 


Proposition 6. By distinguishing static and dynamic parameters, the mean holding time 
Ta of A-terminals is: 


Ta = S1 — S2(1 — Pbs)Pbr — S3 Pbs — Ecs (Tb1 + Pbr(—Tb1 


+ Ph(1 — Pah) Tb2 + (1 — Ph) Pm(1 — Pam) Tb3)), (30) 
where S1, $2, and $3 are generalized static parameters: 
S1 = Ted + Pad Tad + (1 — Pad)(Pid Tid + (1 — Pid)(Tcd + Pis Tis si 
+ (1 — Pis)(Pns Tns + (1 — Pns)(Tcs + 2Tb1)))). On 
S2 = (1 — Pad)(1 — Pid)(1 — Pis\(1 — Pns)(2Tb1 — Tbr — Ph (Pah Tah 
+ (1 — Pah)(Tch + 2Tb2)) — (1 — Ph) Pm (Tem + Pam Tam + (1 — Pam)2Tb3)) (52) 
S3 = (1 —Pad)(1 — Pid)(Pis Tis — Tbs + (1 — Pis)(Pns Tns + (1 — Pns)(Tes + 2Tb1))) (53) 


Proof: Based on (49) in Proposition 5, obviously: 
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Ta = Ted + Pad Tad + (1 — Pad)(Pid Tid + (1 — Pid)(Tcd + Pbs Tbs + (1 — Pbs)(Pis Tis 
+ (1 — Pis)(Pns Tns + (1 — Pns)(Tcs + (1 — Pbr) Tb1 + Pbr (Tbr + Ph (Pah Tah (54) 
+ (1 — Pah)(Tch + Tb2) + (1 — Ph)Pm (Tem + Pam Tam + (1 — Pam) Tb3)))))))). 


After simple mathematical transformations we obtain (50). 


3.7 QoS Indicator 3: Overall Call Attempt Efficiency 


Definition 2.12 in [11] describes “successful call” as “A call that has reached the wanted 
number and allows the conversation to proceed”. Note that ‘call’ is “A generic term 
related to the establishment, utilization and release of a connection. Normally a qualifier 
is necessary to make clear the aspect being considered, e.g. call attempt.” [11]. A ‘call 
attempt’ is “An attempt to achieve a connection to one or more devices attached to a 
telecommunications network.” Therefore, a call may content several call attempts. 

Based on this, we define the Overall Call Attempt Efficiency (Ec), of a communica- 
tion service, as a ratio of the flow intensity (Fc) of the calls attempts with a fully and 
successfully finished communication, to the incoming call attempts’ intensity (Fa). 

In the considered conceptual model, Fc is a sum of flow intensities of virtual devices 
ccl (Case 1 in Fig. 8), cc2 (Case 2 in Fig. 9), and cm (Case 3 in Fig. 10): 


Fe = Feel + Fec2 + Fem. (55) 
Then the Overall Call Attempt Efficiency (Ec) is: 


Fc 
Ec = Ta Ecs ((1 — Pbr) (1 — Par) (1 — Pac) + Pbr (Ph(1 — Pah) (1 — Pac) (56) 


+ (1 — Ph) Pm (1 — Pam))) 


3.8 Network Generalized Subservice Indicators 


The Overall Call Attempt Efficiency (Ec) obviously includes the described indicators 
Carried Switching Efficiency (Ecs) and B-Terminal Connection Efficiency (Eb). From 
users’ and service providers’ point of view, it is important to distinguish the efficiency 
of the subservices of the telecommunication system. Such subservices include: 
switching (finding B-terminal), connection to B-terminal, finding B-user, transmission 
of messages (communication). Here we introduce specific QoS indicators for each of 
these subservices, as parts of the Overall Call Attempt Efficiency (Ec). 

As a QoS-specific indicator of the switching subservice (finding B-terminal), the 
Carried Switching Efficiency (Ecs), proposed in (9), could be used, i.e. as the ratio of 
the flow intensity of the calls reaching the intended B-terminal (Fcs) and receiving either 
a ‘busy tone’ or a ‘ringing tone’ signal, to the incoming call attempt intensity (Fa): 


Ecs = —. 
CS Fa (57) 
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Definition 1. A QoS-specific indicator (Qb) of the subservice ‘Connection to B- 
terminal’ is the ratio of the intensity of the flow seizing B-terminals (Fb) to the flow 
intensity of all calls reaching the intended B-terminal (Fcs): 


Fb 
b = —. 
Q Fes 08) 
Definition 2. A QoS-specific indicator (Qu) of the subservice ‘Finding B-user’ is the 
ratio of the intensity of the flow seizing B-users (Fu) to the intensity of the flow seizing 


B-terminals (Fb): 


Fu 
Qu = Fp (59) 
The intensity of the flow seizing B-users (Fu) is a sum of intensities of the flows: 
after ringing Fb1 — (Far + Fcr) in Case 1 (c.f. Fig. 8); after holding Fb2 in Case 2 (c.f. 


Fig. 9.); and of the carried mailing Fb3: 
Fu = Fbl — Far — Fcr + Fb2 + Fb3. (60) 


Definition 3. A QoS-specific indicator (Qc) of the communication subservice is the 
ratio of the flow intensity of call attempts with fully successfully finished communication 
(Fc) to the intensity of the flow seizing B-users (Fu): 


Fc 
Qc= —. (61) 
The proposed specific QoS indicators of telecommunication subservices are aggre- 
gated because: they aggregate many call attempts from many users and terminals (they 
are stochastic); some of them comprise several parallel services, e.g. Qc includes three 
successful cases — normal interactive communication, communication after call holding, 
and mailing. 
Considering the Overall Call Attempt Efficiency (Ec) as a composition of the four 
considered subservices, one may find that the quality metric is multiplicative: 


ze eos Eh tu be Foy ObOu Oc: (62) 


Ec= — = 
Fa Fa Fes Fb Fu 


This result allows more specific QoS analysis and more adequate QoS management. 


3.9 AB-Terminals’ Characteristics 


In this subsection, analytical expressions of characteristics of AB-terminals (all occupied 
calling terminals (A) and called terminals (B)) — i.e. traffic intensity (Yab), intensity of 
occupation flow (Fab), and holding time (Tab) — are obtained as functions of known 
variables. 
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From the assumptions made and the conceptual model proposed in Subsect. 2.2, it 
is clear that the intensity of the call flows occupying all terminals (Fab) is a sum of 
intensities of the call flows occupying A-terminals (Fa) and the call flows occupying B- 
terminals (Fb): 


Fab = Fa+ Fb. (63) 


The traffic intensity of all terminals (Yab) is a sum of traffic intensity of the A- (Ya) 
and B-terminals (Yb): 


Yab = Ya + Yb. (64) 


Proposition 7. The call flows intensity occupying all terminals (Fab) can be obtained 
by the following equation: 


Fab = Fa (1 + Ecs((1 — Pbr) + Pbr (Ph (1 — Pah) 
+ (1 — Ph) Pm (1 — Pam)))) = Fa (1 + Eb), (65) 
where Ecs is the Carried Switching Efficiency (9) and Eb is the B-efficiency (33). 
Proof: It can be easily seen that (65) follows directly from (33) and (63). 


Proposition 8. The traffic intensity of all terminals (Yab) can be presented by the 
following expression: 


Yab = Fa (Ta+ Eb Tb). (66) 
Proof: (66) follows directly from (43), i.e.: 
Yab = Ya + Yb = FaTa+ Fb Tb = Fa Ta + Fa Eb Tb = Fa (Ta + Eb Tb). (67) 


Terminal Traffic Limitations. Since the number of terminals is limited to Nab 
(Assumption 2), and there is no negative occupancy, the following terminal traffic limi- 
tations obviously apply in the studied system: 


0 < Yab < Nab. (68) 


Proposition 9. Traffic of all simultaneously busy terminals (Yab), after separation of 
static parameters from dynamic parameters, may be expressed from Eqs. (50) and (66) 
as: 


Yab = Fa (S1 — S2 (1 — Pbs) Pbr — $3 Pbs), (69) 
where S1, S2, and S3 are generalized static parameters as per (51), (52), and (53). 


Proof: Based on (64), (49) and (41) we obtain: 
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Yab = Fa (Ted + Pad Tad + (1 — Pad)(Pid Tid + (1 — Pid)(Tcd + Pbs Tbs 
+ (1 — Phs)(Pis Tis + (1 — Pis)(Pns Tns + (1 — Pns)(Tcs + (1 — Pbr)2 Tb1 + Pbr (Tbr 
+ Ph(Pah Tah + (1 — Pah)(Tch + 2 Tb2) 
+ (1 — Ph) Pm (Pam Tam + (1 — Pam) 2 Tb3))))))))). 


(70) 


After algebraic transformation and taking into account (51), (52), and (53), we obtain 
(69). 


Proposition 10. The mean occupation time (Tab) of all simultaneously busy terminals 
can be obtained from (70) as a function of Ta, Tb, and Eb. 


Proof: From the obvious formula Yab = Fab Tab, after replacing Yab with (66), Fab 
with (63), and Fb with (32), we have: 


Tab = 122 = Ya+ Yb _ FaTa+ Fb Tb _ Fa Ta + Fa Eb Tb _ Ta + Eb Tb 
— Fab Fa+Fb ~~ Fa+Fb Fa + Fa Eb 1+Eb 


Proposition 11. The mean occupation time (Tab) of all simultaneously busy terminals 
can be obtained from (71) as a function of S1, S2, $3, Pbr, Pbs, and Eb. 


Proof: From the formula Yab = Fab Tab, and after replacing Yab with (69) and Fab 
with (65), we obtain: 
Yab S1 — S2 (1 — Pbs) Pbr — S3 Pbs 


Tab = — = 


Fab 1 + Eb i 


3.10 Offered Traffic to the Switching System 


Following the definition of equivalent traffic offered to the switching system, traffic 
(ofr.Ys) depends on the offered flow intensity (ofr.Fs) and the occupation (service) time 
Ts of an equivalent switching line: 


ofr.Ys = ofr.Fs Ts. (72) 


The offered flow to the switching system is the flow offered to the first Enter Switch 
device in Fig. 7. This device deflects calls, if there is no free line in the switching system, 
with probability of blocked switching (Pbs) to the Blocked Switching (bs) device, or 
with probability (1 — Pbs) of calls seizing free equivalent switching lines. So the offered 
flow intensity ofr.Fs is: 


ofr.F's = Fa(1 — Pad)(\ — Pid). (73) 


The occupation (service) time of an equivalent switching line (7s) is determined by 
the engaged devices of the switching system (c.f. Subsect. 2.2), namely the s device, 
represented by a box with a dotted dashed line inside the a0 box in Fig. 7, and three 
other a-boxes (a 1 in Fig. 8, a2 in Fig. 9, and a3 in Fig. 10). So consequently: 
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Ts = Slz — $2z Pbr, (74) 
where: 
Slz = PisTis + (1 — Pis)(Pns Tns + (1 — Pns)(Tcs + Tb1)); (75) 


S2z = Tb1 — Tbr — Ph(Pah Tah + (1 — Pah) (Tch + Tb2)) 


— (1 —Ph)Pm (Tem + Pam Tam + (1 — Pam) Tb3) (76) 


Probability of Blocked Switching 


Proposition 12. The probability of blocked switching (Pbs) could be obtained from 
(72) as: 


Pbs = Erl_b(Ns, ofr.Ys). (77) 


Proof: (77) simply expresses the usage of the Erlang-B formula for determination of 
the blocking probability in the switching system, on the basis of the number of equivalent 
internal switching lines (Ns) and the offered traffic ofr. Ys. 


Probability of Blocked Ringing (B-terminal Busy). Under Assumptions 4, 12, 14, the 
following expressions, presenting the probability of blocked ringing (Pbr) as a function 
of the network state Yab (traffic of all A- and B-terminals) and the number Nab of all 
active terminals in the system, could be obtained: 


Yab-1 . 
Pbr = 1 < Yab < Nab, 
r= Napa SeS (18) 
Pbr =0 if 0 < Yab < 1. 


(78) was first proposed as part of the simple overall network teletraffic model, described 
in [24], and its proof was given in [25]. 


4 Results 


4.1 QoS Indicator 4: Overall Traffic Efficiency Indicator 


Based on the “effective traffic” definition [11] as “The traffic corresponding only to the 
conversational portion of effective call attempts”, we define the Overall Traffic Effi- 
ciency Indicator (Ey) as a ratio of the effective traffic of A-terminals (Ycc) to the overall 
traffic of the A-terminals (Ya): 


By = =» (79) 


where 


Yec = Yes + Yec1 + Yec2 + Yem. (80) 
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The Overall Traffic Efficiency Indicator is used for simpler models in [7]. Some 
authors use the name “Overall Traffic Efficiency” in other meaning, and without any 
definition, e.g. [26]. 


4.2 Numerical Results 


The input data considered is typical for voice communications in the Global System for 
Mobile communication (GSM). For simplicity we set M (defined in the explanations of 
Fig. 7) to 0. 

Figure 11 presents results (as functions of the state of the network load — the traffic 
of all AB-terminals Yab, in the theoretical interval [0, 100]) for a network with blocking 
probability due to insufficient resources. The number of all terminals (Nab) in the system 
is 1000 and the number of equivalent switching lines is Ns = 200 (i.e. 20% of Nab). 


100 = 
% Nab = 1000 terminals 


Ns = 200 switching lines 


a 
f=) 
mmberi birnn COOTTOOTOO VOOTONTOD CTOOTONTOD FTOTONTOO COOTOTONG COOTOTTTY 


0 10 20 30 40 50 60 70 80 90 100 
Network Load Yab/Nab (%) 


Fig. 11. The values of the main output parameters of the model of an overall network with limited 
capacity. 


The probability of finding B-terminal busy (Pbr), not shown in Fig. 11, increases 
linearly with the network traffic load, c.f. (78), to almost 1. The numerical results 
demonstrate the existence of a local maximum for the probability of blocked switching 
Pbs. This is because the overall blocking probability in the network, including Pbr and 
Pbs, has an absolute maximum of 1, c.f. Fig. 12. 
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Cumulative Diagram 
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Fig. 12. The causal aggregated overall pie probabilities 


4.3 Overall Pie Parameters 


In the model considered (c.f. Figs. 7, 8, 9 and 10), there are five reasons for call attempt 
ending: abandoning (6 branches), interruption (1 branch), blocking (2 branches), 
unavailable service (1 branch), and successful communication (3 branches). By 
describing the effect caused by each reason, one can construct a ‘causal branch’ for it. 
The causal branch comprises all basic virtual devices involved in the call attempt ending 
due the considered reason, which form the corresponding causal complex virtual device 
with its flow-, time-, and traffic characteristics. Overall, in the model, there are 13 causal 
branches considered. 

The three branches of successful communication have the following service times 
Tp.ccl, Tp.cc2, and Tp.cm: 


Tp.ccl = Ted + Tcd + Tes + Ter + Tccl; (81) 
Tp.cc2 = Ted + Tcd + Tes + Tch + Tcc2; (82) 
Tp.cm = Ted + Tcd + Tcs + Tem + Tem. (83) 


The pie flow intensities, of the three subcases of successful communication, coincide 
with the flow intensity of the last virtual device in the causal branch, respectively: 
Fp.ccl = Fcc; Fp.cc2 = Fcc2 and Fp.cm = Fcm. 


The pie flow probabilities of the three branches of successful communication respec- 
tively are: 
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Fp.ccl Fp.cc2 Fp.cm 
Pp.ccl = ; Pp.cc2 = ; Pp.cm = i (84) 
Fa Fa Fa 


The pie traffic intensities of the three branches of successful communication are: 


_ Fp.cm Tp.cm 


Fp.ccl Tp.ccl Fp.cc2 Tp.cc2_ 
? Ya 


Yp.ccl = > Yp.cc2 = Yp. 
D.CC Ya pcc Ya p.cm (85) 
By analogy, one may easily obtain all other overall pie probabilities, pie flows, and 


pie traffic intensities in the model, by using the normalized parameters found in Sect. 3. 


4.4 Causal Aggregated Overall Pie Parameters 


The overall causal branches may be aggregated as might be needed for telecommuni- 
cation system monitoring, design, or management. A usable aggregation is the causal 
aggregation of all the branches corresponding to one type of call attempts ending. 

For instance, for the case of successful communication, one can express the aggre- 
gated parameters of the branches of the Aggregated Overall Successful Carried Commu- 
nication Branch, considered as a complex virtual device p.c. The metrics are additive 
because this is a pie presentation of the model. 

The causal aggregated overall pie probability of a call attempt ending with successful 
communication (Pp.c) is: 


Pp.c = Pp.ccl + Pp.cc2 + Pp.cm. (86) 


By taking into account (56), the overall causal pie flow intensity of successful 
communication (Fp.c) respectively is: 


Fp.c = Fp.cc\ + Fp.cc2 + Fp.cm = aaron = = = Ec. (87) 
a a 


The overall causal pie traffic intensity of successful communication (Yp.c) is: 
Yp.c = Yp.ccl + Yp.cc2 + Yp.cm. (88) 
Similarly, one may find all other causal aggregated overall pie parameters of the 
model. 
4.5 Numerical Results for Pie Characteristics 


Figures 12 and 13 present numerical results for the causal overall pie probabilities and 
traffic intensities for each of the five reasons for call attempt ending (i.e. abandoning p. 
a, interrupting p.i, blocking p.b, service not available p.n, and successful communication 
p.c) as functions of the network traffic load. 
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Cumulative Diagram 
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Fig. 13. The causal aggregated overall pie traffic intensities 


The overall pie characteristics and their causal aggregations may be considered and 
used as causal-oriented QoS indicators. They allow more precise estimation of the 
dynamic importance of each reason for call attempt ending and thus a more precise 
dynamic effort targeting of the QoS management. 


5 Conclusion 


The presented modeling approach and corresponding numerical results demonstrate the 
big potential and importance of the overall teletraffic models of telecommunication 
systems with QoS guarantees. 

Such models allow prediction of many overall QoS indicators as regards the flow-, 
time-, and traffic characteristics of the A-, B-, and AB-terminals and users, as well as of 
the overall network performance. 

The approach makes easy the separation of an overall telecommunication service 
into different subservices with specific QoS indicators for each of them. 

In this chapter, the newly proposed indicators are network-oriented or terminal- 
oriented. The model, however, is suitable for the development of user-oriented indicators 
as well. This will be a task for future research. 

Applying pie characteristics and their causal aggregations to the subservices results 
in causal-oriented QoS indicators. This allows a more precise estimation of the dynamic 
importance of each reason, in every subservice, of call attempt ending, and thus a more 
precise dynamic effort targeting of the QoS management. Applying a similar approach 
(with specific QoS indicators) for multimedia and multiservice networks seems very 
attractive and promising. 
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Another important goal could be the development of methods for using specific QoS 
indicators as sources for predicting QoE indicators. 
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Abstract. With the emerging IoT and Cloud-based networked systems 
that rely heavily on virtualization technologies, elasticity becomes a dom- 
inant system engineering attribute for providing QoS-aware services to 
their users. Although the concept of elasticity can introduce significant 
QoS and cost benefits, its implementation in real systems is full of chal- 
lenges. Indeed, nowadays systems are mainly distributed, built upon sev- 
eral layers of abstraction, and with centralized control mechanisms. In 
such a complex environment, controlling elasticity in a centralized man- 
ner might strongly penalize scalability. To overcome this issue, we can 
conveniently split the system in autonomous subsystems that implement 
elasticity mechanisms and run control policies in a decentralized manner. 
To efficiently and effectively cooperate with each other, the subsystems 
need to communicate among themselves to determine elasticity deci- 
sions that collectively improve the overall system performance. This new 
architecture calls for the development of new mechanisms and efficient 
policies. In this chapter, we focus on elasticity in IoT and Cloud-based 
systems, which can be geo-distributed also at the edge of the networks, 
and discuss its engineering perspectives along with various coordination 
mechanisms. We focus on the design choices that may affect the elas- 
ticity properties and provide an overview of some decentralized design 
patterns related to the coordination of elasticity decisions. 


1 Introduction 


Elasticity is a quality attribute that is widely used in virtual environments 
together with the “as a service” paradigm to deal with on-demand changes. 
Although elasticity is multi-dimensional [27,72], in most cases, elasticity tech- 
niques just focus on offering elastic resources on demand and dynamically provi- 
sion them to fluctuating workload needs based on the “pay-per-use” concept [23]. 
© The Author(s) 2018 
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In this sense, elasticity mechanisms automatize the process of reconfiguring vir- 
tualized resources, mostly at infrastructural levels, at runtime with the goal of 
sustaining offered Quality of Service (QoS) levels and optimizing resource cost. 

Due to its usefulness, there are many works that have addressed issues 
related to elasticity [23]. However, most of them discuss elasticity in specific 
environments, such as Cloud systems in centralized, large-scale data centers 
(e.g., [46]), edge/fog-based systems (e.g., [54]), network function virtualization 
(NFV) (e.g. [67]), except a few works that consider Internet of Things (IoT) 
Cloud systems, e.g., [70]. In this chapter, we investigate how distributed systems 
can be efficiently executed in the emerging context resulting from the conver- 
gence of IoT, NFV, edge systems, and Clouds. More precisely, our goal is to 
survey elasticity needs, mechanisms, and policies for geo-distributed systems run- 
ning over multiple edge/fog! and Cloud infrastructures. Furthermore, we present 
several design patterns that help to efficiently decentralize and coordinate the 
elasticity control of such systems. The main contributions of this chapter are the 
following: 


— We present how the emerging computing paradigms and technologies help to 
realize elastic systems, which can execute with guaranteed QoS even in face 
of changing running conditions. 

— We survey the key elasticity properties and techniques that have been pre- 
sented so far in the related literature. Specifically, we survey the approaches 
that enable elasticity at different stages of the system life time, distinguishing 
between design-time and runtime. 

— Motivated by the scalability limitation of distributed complex systems, we 
propose different coordination patterns for decentralized elasticity control. 
The latter represent architectural design guidelines that help to oversee large 
scale systems with the aim to improve performance and reliability without 
compromising scalability. 

— We describe the main challenges of nowadays systems so to identify research 
directions that are worth of investigation, in order to develop seamlessly elas- 
tic systems that can operate over geo-distributed and Cloud-supported edge 
environments. 


The rest of the chapter is organized as follows. In Sect.2 we provide an 
overview about elasticity. In Sect. 3 we briefly present the large-scale distributed 
systems we focus on in this chapter, that is systems of IoT, NFV and Clouds 
and discuss their elasticity coordination needs. In Sect. 4 we provide an overview 
of optimization approaches used to take elasticity choices. In Sect. 5 we present 
some design patterns that can be used in distributed edge Cloud environments 
to coordinate elasticity decisions in a decentralized fashion. In Sect. 6 we discuss 
some research challenges for elasticity control. We conclude the chapter in Sect. 7 
with some final remarks. 


1 From our point of view, in this chapter we consider edge computing as interchange- 
able with fog computing, although we are aware that some differences exist [53]. 
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2 Overview of Elasticity 


Elasticity has become one of the key attributes of self-adaptive software systems. 
Although it has been and is widely investigated, there is no unique consensus 
related to elasticity definition. The most frequently used definition of elasticity 
has been formulated by Herbst et al. [36] as follows: “Elasticity is the degree to 
which a system is able to adapt to workload changes by provisioning and depro- 
visioning resources in an autonomic manner, such that at each point in time the 
available resources match the current demand as closely as possible”. The elastic- 
ity quality attribute is tightly related to the scalability and efficiency attributes. 
Scalability addresses a typically static system attribute related to the ability 
of a system to adjust its resources to changing load. However, volatile software 
environments demand a continuous adaptation process [78], which yields con- 
siderable additional costs if applied manually. Another, closely related quality 
attribute is efficiency, that is related to the amount of the resources consumed 
to process traffic needs. Traditionally, these terms were related to a static sys- 
tem configuration and not considered in terms of dynamical system architecture 
models. 

With the emergence of virtualization technologies, especially lightweight ones 
such as containers [12] and unikernels [13], there are new automation possibili- 
ties no longer related to the physical scaling of system resources, but rather to 
the dynamic adaptation of the system to deal with changing environment condi- 
tions. System/application components can scale out according to traffic needs to 
accommodate changes in the traffic volumes and avoid SLA violations, and can 
scale in to save energy and costs caused by over—dimensioning. Virtualization 
technologies have opened new possibilities to system automation and implemen- 
tation of elastic attribute into dynamic systems. However, when implemented 
in real systems, the beneficial effects of elasticity can be limited mainly by the 
speed of the system adaptation process and by the precision in aligning the allo- 
cated virtual resources to the temporal resource demands. Therefore, dynamic 
adaptation models have also to consider limitations of real systems to adapt 
timely and precisely. 

The main aim of dynamic adaptation models is to exploit optimization algo- 
rithms that guide elastic decisions at runtime, as traffic changes for the best QoS 
and cost gains, while considering a large combinatorial set of architectural design 
options that are no longer manageable by human designers [77]. Optimization 
solutions can be categorized according to several key aspects [4]: 


— Which software attributes are to be optimized? Every software attribute for 
which a representing quantifiable model can be provided is a candidate to 
be used in the quality evaluation function. Quality attributes also include 
economical attributes, such as associated costs [10], among which operational 
infrastructure costs prevail in the Cloud era [27]. According to the selected 
quality attributes, optimization approaches can be single- or multi-objective. 

— What design choices are considered under optimization? In order to pro- 
vide an automatized optimization process, a machine-readable format of the 
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software architecture is required. These can vary from formal models, UML 
models, or models in different architectural languages such as ADL. On top 
of that model, there must also exist an unambiguous definition of what com- 
binatorial, categorical or ordinal variables are to be considered in forming an 
optimization search space. These definitions may also yield additional design 
constraints, which exclude some of the combinations due to some architectural 
constraints (e.g., applying certain architectural style). In literature, these vari- 
ables are referred as architectural degrees of freedom (DoF) [42]. 

— In which phase does optimization take place? According to this dimension, 
solutions vary from design-time optimization to runtime optimization meth- 
ods. In design-time approaches, the system is first modeled in the desired 
language where optimization is performed on derived models according to 
specific quality attributes. These can include block diagrams, Markov chains, 
queuing networks, Petri nets with quality attributes predicted by using a 
computer simulation or analytical models when they are available in closed 
form. Runtime approaches are generally simpler due to stringent execution 
speed and overhead constraints, so they often consider optimizing only a 
single attribute or they naively combine several attributes using the simple 
additive weighting (SAW) method [39]. 


A thorough literature review of existing optimization methods used in soft- 
ware architectures was performed in [4]; therefore, we analyze only research 
works that have been conducted afterwards. We focus on emerging systems of 
IoT, virtual network functions, and distributed Clouds. We also give special 
attention to optimization in the domain of distributed system environments and 
classify existing works according to the phase of execution. Furthermore, we con- 
sider decentralized coordination design patterns that can be employed to realize 
a distributed elasticity control where elasticity decisions have to be taken at 
multiple layers. 


3 Systems of IoT, NFV and Clouds and Their Elasticity 
Coordination Needs 


Research works related to elastic architectures and applications spawn multi- 
ple areas, ranging from embedded systems and information systems design, to 
software performance engineering and quality attributes [4]. A general observa- 
tion from all involved research communities is that system complexity generally 
increases and, as such, it is hard to manage and scale, is expensive to maintain 
and change. A general trend is to define new system architectural models that 
decompose complex system architectures into smaller and easily self-manageable 
objects, e.g., microservices [26]. These new system architectures are based on vir- 
tualization and automatic software provisioning and configuration technologies 
to enable dynamic system models that can autonomously adapt to face varying 
operating conditions. 

Emerging systems and services are and will be characterized by the inte- 
gration and convergence of different paradigms and technologies that span from 
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IoT, virtual network functions, distributed edge/fog computing, and Cloud com- 
puting [73]. We briefly review the main features of some of these paradigms and 
technologies prior to analyze their coordination needs. 

NFV is a new network architecture framework where network functions, 
which traditionally used dedicated hardware (e.g., network appliances), are 
implemented in software that runs on top of general purpose hardware, exploit- 
ing virtualization technologies. Virtual network functions can be interconnected 
into simple service compositions (called chains) to create more complex com- 
munication services. Component network functions in the service chain can be 
scaled either vertically or horizontally (i.e., either acquiring more powerful com- 
puting resources or spawning more replicas of the same virtual network function 
and load balancing among them). 

Edge and fog computing paradigms provide a distributed computing and stor- 
age infrastructure located at the edges of the network, resulting in low latency 
access and faster response times to application requests. These paradigms turn 
out to be particularly effective in moving computation and storage capabilities 
closer to data production sources (e.g., IoT devices) and data consumption des- 
tinations, which are heavily dispersed and typically located at the edges of the 
network. Therefore, they can better meet the requirements of IoT applications 
with respect to the use of a conventional Cloud [64]. 

Dealing with elasticity for such emerging systems is important and chal- 
lenging. However, elasticity techniques that have been separately studied for 
virtualized systems mainly in large-scale and centralized Cloud data centers or 
less frequently in distributed edge/fog environments, may not be sufficient to 
efficiently manage more complex environments that arise from the convergence 
of IoT, NFV and Clouds. Figure1 outlines the concept view of such virtual- 
ized systems, built atop various views on IoT Cloud [44,70]. With such systems, 
it is crucial to have an end-to-end elasticity [71], requiring a strong elasticity 
coordination between the IoT, NFV and Clouds. For example, let us consider 
how elasticity coordination would help to prepare at best the Cloud to serve data 
from the edge. Currently, most of the times, the Cloud does not really care about 
the edge - if more data come, the Cloud reacts and provisions more resources. 
However, if the elasticity demands from the edge were known and propagated 
to the Cloud in advance, the Cloud could be able to provision resources in a 
more effective way. This can be done when we consider that we control on both 
sides - edge and Cloud. On the one hand, the end-to-end elasticity requires us to 
work horizontally across IoT, NFV, and Cloud. On the other hand, each system 
might have different layers, as shown in Fig. 1 and discussed in [65]. Therefore, 
it is crucial to coordinate elasticity both horizontally and vertically across lay- 
ers and across subsystems. This leads to our focus on models and techniques to 
control and manage elasticity. 

The following key elasticity properties and techniques are crucial to us to 
understand: 


— Which types of elasticity properties are suitable for which layers (resources, 
data, service, network)? 
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Fig. 1. System of IoT, virtual network functions and Cloud (adapted from [44]). 


— Which elasticity control techniques are suitable for which parts (edge, net- 
work, or data center) and which models are useful for coordinating them? 

— How to connect elasticity coordination between software engineering view and 
system engineering view? 


4 Existing Solutions — Pros and Cons 


4.1 Software Attributes and Design Choices 


For a successful software optimization it is important to select appropriate soft- 
ware attributes that reflect the users perception of the quality. The most promi- 
nent software attribute is performance as it is the subject of most optimization 
techniques. Performance expresses timings involved around different computa- 
tion paths. There are many metrics that express software performance with most 
important being: response time, throughput, and utilization [40]. 

Another common attribute that is optimized is reliability: the system ability 
to correctly provide a desired functionality during a period of time in the given 
context. Another term closely related to reliability is availability: the percentage 
of time a system is up and running to serve requests [10]. Both these terms are 
contained in dependability attribute: overall probability that system will produce 
desired output under specified performance, thus overall user confidence that 
system will not fail in normal operation. 

System costs can also be considered as a business quality attribute [10]. 
They can be divided to design-time costs: development costs, licensing, hard- 
ware acquiring, and maintenance costs as well as runtime costs: operational 
infrastructure costs and energy costs. 
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Design choices that are considered in optimization process should not alter 
any functionality of the end-system, but affect only its quality attributes. Choices 
can be software related or deployment related [42] and are categorized in Table 1. 


4.2 Design-Time Approaches 


Historically, design-time optimization solutions were oriented to embedded sys- 
tems because of their stringent extra-functional properties (EFPs) requirements. 
For that purpose, ArcheOpteryx tool [3] is an Eclipse plug-in that implements 
AADL specifications [30] and employs multi-objective evolutionary heuristics for 
approximating optimal solution of embedded component-based systems. Specif- 
ically, ArcheOpteryx optimizes communication cost between components in two 
ways: it optimizes data transmission reliability formed around total frequency 
of component interactions against network connection reliability; and commu- 
nication overhead due to limited network bandwidth and delays. Another rep- 
resentative solution from the automotive domain is EAST-ADL language [74], 
inspired by MARTE modeling language [59]. EAST-ADL also employs genetic 
algorithms (GAs) with multi-objective selection procedure NSGA-II [25], quite 
common in all multi-objective approaches. Quality is evaluated using fault-tree 
models for safety analysis and the MAST analysis tool was used to derive mean 
system response times. Component life-cycle cost was also one of the objectives. 

Recently, the focus of design-time optimization shifted towards information 
systems, as systems became more complex and at the same time more reli- 
able with stricter EFP requirements regulated through service-level agreements 
(SLAs). The majority of research works employs search heuristics through vari- 
ous multi-objective evolutionary algorithms. Li et al. [45] applied a model-based 
methodology to size and plan enterprise application under SLAs, considering 
response time and cost as optimization quality attributes. They modeled a multi- 
tier enterprise application with a closed queuing network model and applied an 
evolutionary algorithm to evaluate different configurations. They parametrized 
queue network models by measuring the real system and applied exponential 
arrival and service times. Mean Value Analysis was used to obtain the response 
time in a stationary state. A similar approach was also employed in [60], where 
multi-objective evolutionary algorithms have been used to optimize performance 
and reliability of system expressed through AADL models. Menascé et al. [48] 
proposed to optimize performance and availability of service-based information 
systems by applying a hill-climbing optimization method. Overall system is rep- 
resented as a service-activity model which models execution sequence of dif- 
ferent services. The PerOpteryx tool [43] applied a Palladio Component Model 
(PCM) [11] for predicting the performance of various architecture configurations 
of component-based information applications. Optimized attributes also included 
system performance and cost. Industrial case study of PerOpteryx tool was con- 
ducted in [32]. The underlying PCM model is automatically transformed to 
Layered Queue Models (LQM) [66] with predicted values obtained using a simu- 
lation. PerOpteryx also applies multi-objective genetic algorithm with NSGA-II 
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Table 1. Possible software and deployment design choices 


Software design choices 


Selection of components 


Wherever functionality is encapsulated 
within interchangeable components like in 
component-based or service-based 
architectures a set of compatible 
components can be expresses with different 
quality properties. In component-based 
system such selection is often available only 
at design-time, while service-based systems 
enable services to be selected in run-time 


Component configuration parameters 


Often components provide further 
configuration parameters that affect their 
delivered quality. This is especially the case 
in component-based architectures. For 
example, in a component that processes and 
compresses input data, a compression ratio 
can be altered which can balance the 
output quality over processing performance. 
Parameters can also be non-numerical, like 
selection of compression algorithms or 
supported encryption algorithms in SSL 
communication. Such parameters also 
include the multiplicity of logical resources 
like limits for allowed number of threads or 
database connections or state the priorities 
for certain actions in concurrent processing 
scenarios. These all affect overall delivered 
component quality and thus can be subject 
to optimization 


Deployment design choices 


Allocation 


Allocation is defined by a mapping from 
software components to available hardware 
resources. Each component can be allowed 
on only a single resource or deployed across 
several resources. Components can possess 
certain allocation constraints that need to 
be satisfied such as minimal amount of 
RAM required. Distributed systems are 
very sensitive to allocation as it affects 
quality attributes like response time, 
throughput, reliability and availability of 
system. Performance is affected with the 
communication overheads between 
components allocated on different servers, 
where reliability suffers if components are 
deployed on same servers which requires a 
careful balance 


(continued) 
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Table 1. (continued) 


Deployment design choices 


Replication 


Replication design choice states the number of deployed 
component instances required. Replication affects reliability 
and overall performance. When component replication is 
present, additional components are required like 
load-balancers for balancing workload between several 
components or switches that route traffic from primary 
components to fail-over components in passive replication 
scenarios. Replication design freedom is the key run-time 
parameter in elastic systems as it altered to continually adapt 
maximal component processing capacity to current workload 
requirements 


Resource selection 


When performing software component allocation to hardware 
resources a number of different configuration options is 
present: selecting appropriate disk storage, type of CPU/GPU, 
etc. In embedded systems these are predetermined at 
design-time but for elastic information systems they can be 
varied in runtime as well in reconfiguration process. Resource 
selection primarily affect costs and performance attributes but 
can also affect dependability attributes. Resource selection can 
be achieved at different granularity levels. Sometimes selection 
refers to individual hardware components, but more often it 
refers to selecting pre-configured available resource types, like 
selecting virtual machine type from Cloud provider. In the 
case of selecting whole servers, resource selection can also 
provided software packages like OS, pre-installed tools and 
platforms etc. 


Resource parameters 


Selected resources, both hardware and software, can have 
many tunable parameters that can be altered at 
selection/installation time, or sometimes even at runtime. At 
selection, resources can be chosen based on different 
parameters (e.g., CPU clock-rate, number of cores, amount of 
RAM) and during installation different platform parameters 
can be altered (e.g., virtual memory available, TCP stack 
parameters, JVM configuration). If supported, some 
parameters can also be altered during runtime 


Resource provider 


When selecting resources, different competing providers can be 
chosen. Differences lay in hardware offers, pricing amount, 
pricing model options, and offered SLAs. Greatest benefit from 
choosing diverse resource providers is increase in system 
reliability and prevention of vendor lock-in 


Resource location 


In the era of IoT, edge computing and latency critical 
applications, resource location is also an important factor to 
optimize. Data center location, whether Cloud data center or 
micro edge/fog data center, impacts largely on network 
latencies, especially in distributed mobile systems 
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selection method. By employing simulation, a more sophisticated set of mea- 
sures, such as percentiles which are often agreed in SLAs, can be obtained. A 
faster evaluation method that can also predict performance measures beyond 
mean values is fluid analysis [69]. Pérez and Casale [57] suggested a method for 
deriving fluid models LQN networks obtained through PCM models. Fluid mod- 
els are described by a set of ordinary differential equations that approximate the 
evolution of Markovian models, in their case closed class queue networks. Malek 
et al. [47] proposed a method for optimizing availability, communication secu- 
rity, latency, and energy consumption that are influenced from various deploy- 
ments of a distributed information system. They applied both Mixed-Integer 
Nonlinear Programming (MINLP) algorithms and genetic algorithms to solve 
the derived optimization problems. They also provided guidelines on strengths 
and weaknesses of both approaches. There is also a semi-automatized approach 
which employs formalized expert knowledge used to suggest different solutions 
to recurrent problems, like performance bottlenecks as presented in [7]. In [8] 
anti-patterns are mitigated using a fuzzy approach so that each anti-pattern is 
labeled with a probability of occurrence. Similar efforts tailored for Cloud envi- 
ronments have been also proposed [62]. Perez-Palancin et al. [58] suggested a 
framework for analyzing trade-offs between system cost and adaptability. They 
modeled service adaptability through several metrics based on the number of 
used components for providing a given service and the total number of compo- 
nents offering such service. 

There are also recent solutions that are specialized for dealing with dynami- 
cally used logical resources such as elastic Cloud infrastructure. These solutions 
must take into account the dynamics of used resources over time, which was not 
supported in before-mentioned approaches. The SPACE4CLOUD project [31] 
resulted in a design-time tool for predicting costs and performance of certain 
Cloud information system topology expressed in PCM. In order to enable fully 
automated search over design space, the SPACE4CLOUD tool was combined 
with PerOpteryx evolutionary heuristics in a separate study [20]. Evangelinou 
et al. [19,29] further developed such a tool to provide a methodology for migrat- 
ing existing enterprise applications to Cloud by selecting an optimal deploy- 
ment topology that takes topology cost and performance into account. To enable 
faster search, initial solutions for evolutionary algorithm are provided through 
Mixed-Integer Linear Programming (MILP) algorithm. Evolutionary algorithms 
are supplemented with local search heuristics. Like before, application topol- 
ogy in SPACE4CLOUD is optimized for a specific workload intensity, typically 
at peak. Andrikopoulos et al. [6] employed a graph-based model to represent 
a Cloud application topology with a complementary method for selecting the 
best topologies based only on operational infrastructure cost provided by simple 
analytical models. 


4.3 Runtime Approaches 


In contrast to design-time approaches, runtime approaches continually variate 
the chosen architecture DoFs in order to adapt to volatile environments while 
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keeping the desired application attributes optimal. Runtime optimization is pri- 
marily focused on, but not limited to, availability, performance, and cost quality 
attributes and is considered the key characteristic of self-adaptive systems [24]. 
Since algorithms are running online at all times, they are forced to apply simpler 
but very fast analytical models like simple aggregation functions (summation, 
maximal and average values) or analytical models of M/M/1 queues. Research 
efforts have been mostly oriented towards service-based [52] and Cloud systems. 
Calinescu et al. [14] systematized a majority of runtime optimization research 
involved in service-based systems, and based their approach around Discrete 
Time Markov-Chain models. They provided a means to formally specify QoS 
requirements, model-based QoS evaluation, and a MAPE-K cycle [38] for adap- 
tation. Passacantando et al. [56] formulated runtime management of IaaS infras- 
tructure from a SaaS Cloud provider viewpoint as a Generalized Nash Equi- 
librium Problem (GNEP). SaaS providers strive to minimize the costs of used 
resources, and in parallel IaaS providers tend to maximize profits. From per- 
formance aspect, services are modeled as simple M/G/1 queues. A distributed 
algorithm based on best-reply dynamics is used to compute the equilibrium peri- 
odically. Gomez Saez et al. [61] provided a conceptual framework for achieving 
optimal distribution of application that involves both runtime and design-time 
processes. Nanda et al. [51] formulated the optimization problem for minimizing 
the SLA penalty and dynamic resource provisioning cost. Their model defined 
only single DoF expressed as number of virtual machines designated to each 
application tier. Grieco et al. [33] proposed an algorithm for the continuous 
redeployment of multi-tier Cloud applications due to system evolution. They 
proposed an adaptation graph aimed to find the best composition of adaptation 
processes satisfying a goal generated at runtime. Goals are defined as transi- 
tions from original to destination state. Recently, the SPACE4CLOUD tool was 
extended to provide optimal runtime scaling decisions limited to replication DoF 
[34], while Moldovan et al. [49] provided a cost model for resource replication 
that is more aligned with public Cloud offerings. 


4.4 Other Relevant Research 


The third group of works we consider is not directly targeting optimization itself, 
but exploit techniques and mechanisms that are relevant for further optimiza- 
tion. A mapping study that identifies relevant research around modeling QoS 
in Cloud is in [9]. Copil et al. [22] provided general guidelines to build elastic 
systems in Cloud, IoT, or hybrid human-computer context. A research agenda 
for implementing optimization tools for data-intensive applications has been pre- 
sented in [18,21]; the main concepts to consider are volume, velocity, and loca- 
tion of data. Kistowski et al. [41] proposed to model incoming workload intensity 
using time-series decomposition to identify seasonal, trend and noise components 
which could yield in more robust optimization techniques. Andrikopoulos et al. 
[5] proposed a GENTL language for modeling multi-Cloud applications as the 
foundation for any optimization of its deployment. They argued that GENTL 
contains the right amount of abstraction that captures essential concepts of 
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multi-Cloud applications. Similar claim and model are also the result of research 
by Wettinger et al. [75], where a concept of deployment aggregate is introduced 
to automate deployment of Cloud applications. Etxeberria et al. [28] argued 
there is a large amount of uncertainty present in performance results and pro- 
posed a technique to tame such uncertainty, while Nambiar et al. [50] highlighted 
all challenges involved in model-driven performance engineering and proposed 
a more modular approach to modeling performance. Pahl and Lee [55] demon- 
strated the application of more lightweight virtualization solutions in the context 
of edge computing. Such virtualization capabilities should also be integrated in 
architecture optimization techniques. 

A systematic mapping study on software architectural decisions like docu- 
menting decisions or functional requirements is provided in [68]. It identifies a 
recent increase in interest involved around architectural decisions. Considering 
all research involved on architecture optimization with these conclusions, there 
is a need for further incentives in closing the gap between human and automated 
processes around architecture formation and optimization. 


5 Coordination Patterns for Decentralized Elasticity 
Control 


An elastic system has the ability to dynamically adjust the amount of allocated 
resources to meet variations in workload demands [2,23]. To realize an elastic 
system, we need to perform several operations aimed to observe the system evo- 
lution, determine the scaling operations to be performed, and finally reconfigure 
the system (if needed). A prominent and well-known reference model to orga- 
nize the autonomous control of a software system is MAPE [24,63]. It includes 
four main components, namely Monitor, Analyze, Plan, and Execute, which are 
responsible for the key functions of self-adaptation, and specifically of elasticity. 

The Monitor component collects data about the controlled system and the 
execution environment. Furthermore, the Monitor component specifies the inter- 
action mode (e.g., push, pull) and the interaction frequency (e.g., time-based, 
event-based) that starts the control loop. Afterwards, the Analyze component 
processes the harvested data, so to identify whether adapting the system (e.g., 
scaling out the number of system resources) can be beneficial for its performance. 
During this phase, the costs related to the reconfiguration (e.g., due to the migra- 
tion and/or replication of the resource and its state) should be also taken into 
account, because as a side effect the reconfiguration could impact negatively on 
the system performance. For example, too much frequent reconfigurations that 
require data movement and/or freezing the application can determine a QoS 
degradation (e.g., in terms of availability). 

If some adaptation action is needed, the Plan component is triggered and is 
responsible for: determining which system component needs to be reconfigured; 
identifying whether the number of resources (e.g., computing, network, storage) 
needs to be increased or decreased; and computing the number of resources to 
be added/removed/migrated and, if required, their new location. As soon as the 
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reconfiguration strategy is computed, the Execute component puts it in action. 
According to the controlled system, enacting a reconfiguration can be translated, 
e.g., in updating routing rules, in replicating processing elements, in migrating 
state information and component code. 

When the controlled system is geographically distributed (e.g., Fog comput- 
ing, distributed Cloud computing) or when it includes a large number of compo- 
nents (e.g., IoT devices, network switches), a single MAPE loop, where decisions 
are centralized on a single component, may not effectively manage the elastic- 
ity. As described by Weyns et al. in [76], different patterns to design multiple 
MAPE loops have been used in practice by decentralizing the functions of self- 
adaptation. In this section, we customize the patterns proposed in [76] aiming to 
provide some guidelines for the development of systems that control the elastic- 
ity of geographically distributed resources. The distributed system components 
running the MAPE loop can be arranged in a hierarchical architecture (Sect. 5.1) 
or in a flat architecture (Sect. 5.2). In the first case, MAPE loops are organized 
in a hierarchy, where some control loops supervise the execution of subordinate 
MAPE loops. In the latter case, MAPE loops are peers one another; as such, 
they can work autonomously or coordinate their execution by exchanging control 
messages. 


5.1 Hierarchical Patterns 


In this section, we present three patterns that organize the MAPE loops in a 
hierarchy, where a higher-level control loop manages subordinated control loops. 


Master-Worker Pattern. When a system includes a large number of compo- 
nents, having a (single) centralized component that performs elasticity decisions 
might easily become the architecture bottleneck. To overcome this issue, the sys- 
tem can be organized so to decentralize some of the MAPE operations, exploiting 
the ability of distributed components to run control operations. Nevertheless, the 
system may need to perform the monitoring and planning operations locally at 
each distributed component, e.g., because of special equipment, size of exchanged 
data, specificity of operations. On the other hand, to preserve a consistent view of 
the system and meet global guarantees while keeping the system simple, the lat- 
ter can include a centralized entity which coordinates the elasticity decisions. As 
such, it can easily prevent unneeded reconfigurations or conflicting scaling oper- 
ations. Differently from a completely centralized approach, this design pattern 
relieves the burden of the central component, which now oversees only a subset 
of the MAPE phases, by including and integrating multiple, decentralized con- 
trol cycles, in charge of performing locally some control activities. Specifically, 
this pattern is well suited when the distributed entities to be controlled have 
monitoring and actuating capacity and can change their behavior according to 
external decisions (e.g., machines in smart manufacturing, SDN devices, Virtual 
Network Functions). 
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Pattern: A master-worker pattern structures the system in a two-level hierar- 
chical architecture. At the highest level, a single master component oversees the 
analysis and planning of scaling operations. At the lowest level, multiple indepen- 
dent components run the distributed Monitor and Execute operations. Figure 2 
provides a graphical representation of this pattern. Each distributed Monitor 
component communicates with a centralized Analyze component by providing 
aggregated (or high-level) information on the nodes, which can be used to steer 
some elasticity action on the system. Should a scaling operation be performed, 
the centralized component plans an adaptation strategy, which consists in deter- 
mining the resources to be scaled and the magnitude of the scaling operation. 
The planned decision is sent back to the distributed nodes, which will ultimately 
enact them. Observe that, by centralizing the Analyze and Plan components, 
this pattern facilitates the implementation of efficient scaling policies that aim 
at achieving global objectives and guarantees. On the other hand, sending the 
collected monitoring information to the master component and distributing the 
subsequent scaling actions may impose a significant communication overhead. 
Moreover, the centralized component that runs the Analyze and Plan phases 
may become a bottleneck in case of large-scale distributed systems. 


Master 
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Fig. 2. Hierarchical MAPE: master-worker pattern. 


Example: SDN-switches are in charge of forwarding data as requested by the 
SDN controller. To guarantee performance, a SDN controller can allocate net- 
work path to route traffic with specific QoS requirements. For example, a path 
can be dedicated to a specific data-intensive and latency-sensitive application, 
or multiple paths can be used in parallel to increase the bandwidth in specific 
network segments. The allocation of resources can be changed at run-time, by 
monitoring and analyzing the network, so to plan a strategy for reallocating 
resources (i.e., network paths). In this setting, an elastic system can include 
components that realize MAPE control cycles at two different levels. At the 
lowest level, SDN devices run the Monitor and Execute components of MAPE, 
whereas at the higher level, the SDN controller runs the Analyze and Plan com- 
ponents. A SDN controller retrieves network information (e.g., link utilization) 
from distributed SDN-enabled devices. By analyzing this information, the con- 
troller can plan to scale network resources, aiming to improve or reduce the 
bandwidth capacity of a network (logical) path between two communicating 
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devices. To scale the capacity of a network path, the SDN controller can allocate 
multiple parallel paths to route data. Afterwards, the distributed SDN devices 
can enact the new forwarding rules, and reroute packets accordingly. 


Regional Planning Pattern. A large scale system can be organized in multi- 
ple, distributed, and loosely coupled parts (or regions), which cooperate to real- 
ize a complex integrated system. Computing and performing scaling decisions on 
this system might be challenging, because we would like to control elasticity of 
subsystems within a single region as well as the elasticity of the overall system 
distributed across multiple regions. Typical scenarios involve federated infras- 
tructures, where networks, Cloud infrastructures, or Cloud platforms should be 
controlled to realize an elastic system. In this context, scaling operations within 
regions may aim to optimize resource allocation, while adaptations between 
regions may optimize load distribution or improve communications under par- 
ticular conditions. For example, in the Fog environment, an elastic system can 
improve and reserve fast communications links from resources at the edge of the 
network to the Cloud, in response to emergency events (e.g., earthquakes, floods, 
tsunami). 


Pattern: In the regional planning pattern, represented in Fig.3, the system is 
organized in regions. A region has a two-level hierarchical structure, where the 
top level includes a Plan component (a regional planner), and the lower level 
includes components performing the four MAPE phases. The regional planner 
collects the necessary information from the underlying subsystems, so to deter- 
mine when and how to scale the system components. Moreover, regional plan- 
ners interact with one another to coordinate adaption actions that span multiple 
regions. Within each region, the Monitor component observes the region subsys- 
tem, the local Analyze component elaborates the collected data and reports the 
outcomes to the regional planner. Leveraging on the collected information, the 
latter can plan a scaling operation that involves a single region or that spans 
across multiple regions. In the latter case, the regional planner might interact 
with other regional planners to coordinate the scaling operation. Once they agree 
on a scaling strategy, they can enact the adaptation by activating the Execute 
components of the respective regions. This pattern is well suited when regions 
are under different ownership, because the MAPE loop of a region exposes only 
limited information (i.e., the outcome of the analysis phase), without providing 
raw data (which result from the monitoring components). Similarly, once the 
scaling strategy is devised, the region is responsible of enacting the required 
adaptation actions; as such, the implementation details can be hidden to the 
regional planner. 


Example: In a Fog computing environment, near-edge micro data centers sup- 
port the execution of distributed applications by providing computing resources 
near to the users (or to data sources). In a wide area, these micro data centers 
can be managed by different authorities (e.g., university campus, IT company) 
and usually expose Cloud-like APIs, which allows to allocate and release micro- 
computing resources as needed [53]. The combination of Fog and Cloud allows 
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Fig. 3. Hierarchical MAPE: regional pattern. 


the provisioning of resources with different computational and network capabil- 
ities, thus opening a wide spectrum of approaches to realize system elasticity. 
For example, a system can be scaled within a region, using multiple resources 
belonging to the same infrastructure (i.e., Fog, Cloud), or can be scaled across 
multiple regions, so to take advantage of their different features (e.g., the com- 
putational power of Cloud resources, the reduced network delays of near-edge 
devices). In general, separate Fog/Cloud data centers can be regarded as differ- 
ent regions, possibly under different ownership domains. Within each region, the 
system runs a MAPE control cycle which comprises only the Monitor, Analyze, 
and Execute components. Relying on these components, the system can monitor 
resource utilization as well as incoming workload variations, and trigger scaling 
operations. In such a case, the regional planner, which can run inside or outside 
the region (e.g., in the Cloud), is invoked. When the planner determines the scal- 
ing strategy, it can decides to offload some computation to other regions (i.e., 
by possibly acquiring resources in the Cloud) or to change resource allocation 
within the region under its control. 


Hierarchical Control Pattern. When the complexity of a distributed system 
increases, also controlling its elasticity might involve complex machinery. In this 
case, a classic approach to rule the system complexity relies on the divide et 
impera principle, according to which the system is split in different subsystems, 
which can be more easily controlled. To steer the adaptation of the overall system 
behavior, another control loop coordinates the evolution of each subsystem. The 
resulting system includes multiple control loops, which work at different time 
scales and manage different resources or different kinds of resources. In this 
context, control loops need to interact and coordinate their actions to avoid 
conflicts and provide certain guarantees about elasticity. 


Pattern: The hierarchical control pattern provides a layered separation of con- 
cerns to manage the elasticity of complex systems. According to this pattern, 
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the adaptation logic is embedded in a hierarchy of MAPE loops. Layers of the 
hierarchy oversee different concerns at a different level of abstraction and, pos- 
sibly, by working at a different time scale. Usually, each layer includes a MAPE 
loop which comprises all the four control steps. However, different sub-patterns 
can be obtained by customizing the hierarchical MAPE and the way the hier- 
archical layers interact with one another. As regards the latter, a wide range 
of opportunities can be elaborated: on the one side, a higher level component 
works without a direct interaction with lower levels; on the other side, a higher 
level component (e.g., Monitor) recursively interrogates the lower level compo- 
nents (e.g., Monitors) to perform its tasks. Figure 4 illustrates the hierarchical 
control pattern, where the Monitor and Execute components strictly cooperate 
with the lower levels components, whereas the Analyze and Plan components 
work autonomously for each level. This approach is well suited for a system 
where multiple but dependent levels of control can be easily identified, such as 
distributed applications (or services), which are made as a combination of small, 
elastic building blocks. 
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Fig. 4. Hierarchical MAPE: hierarchical control pattern. 


Example: Data Stream Processing (DSP) applications are a prominent app- 
roach for processing Big Data; indeed, by processing data on-the-fly (i.e., without 
storing them), they can produce results in a near real-time fashion. A DSP appli- 
cation is represented as directed acyclic graph, where data sources, operators, 
and final consumers are interconnected by logical links, where data streams flow. 
Each operator can be regarded as a black-box processing element that receives 
incoming streams and generates new outgoing streams. To seamlessly process 
huge amount of data, DSP applications usually exploit data parallelism, which 
consists in increasing or decreasing the number of instances for the operators [37]. 
Multiple instances of the same operator can be executed over multiple computing 
nodes, thus increasing the amount of incoming data processed in parallel. 

To control the elasticity of DSP applications in a scalable and distributed 
manner, a DSP system can include multiple MAPE control loops, organized 
according to the hierarchical control pattern [17]. We consider a two layered 
approach with separation of concerns and time scale between layers, where the 
higher level MAPE loop controls subordinate MAPE components. At the lower 
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level and at a faster time scale, an operator manager is the distributed entity 
in charge of controlling the replication degree of a single DSP application oper- 
ator through a local MAPE loop. It monitors the system logical and physical 
components used by the operator and then, by analyzing the monitored data, 
determines whether a local operator scaling action is needed. In positive case, 
the lower-level analyze component issues an operator adaptation request to the 
higher layer. At the higher level and at a slower time scale, an application man- 
ager is the centralized entity that coordinates the elasticity of the overall DSP 
application through a global MAPE loop. First, it monitors the global appli- 
cation behavior. Then, it analyzes the monitored data and the reconfiguration 
requests received by the multiple operator managers, so to decide which reconfig- 
urations should be granted. Afterwards, the granted decisions are communicated 
to each operator manager, which can, finally, execute the operator adaptation 
actions. The higher level control loop has a more strategic view of the applica- 
tion evolution, therefore it coordinates the scaling operations. Since performing 
a scaling operation introduces a temporary application downtime, the global 
MAPE loop limits the number of reconfigurations when they are not needed 
(e.g., when the application performance requirements are satisfied). Conversely, 
when the application performance is approaching a critical value (e.g., maximum 
response time), the global MAPE loop is more willing to grants reconfigurations, 
so to quickly settle the performance issues. 

Such hierarchical design of the elasticity control allows to overcome the sys- 
tem bottleneck represented by the centralized components of the MAPE loop in 
the master-slave pattern (e.g., see [16] for its application to elastic data stream 
processing), especially when the system is composed by a multitude of processing 
entities scattered in a large-scale geo-distributed environment. 


5.2 Flat Patterns 


We now discuss two patterns that organize the MAPE loops in a flat structure, 
where multiple control loops cooperate as peers to manage the elasticity of a 
distributed system. Due to the lack of central coordination, designing a stable 
scaling strategy is challenging, although the resulting control architecture makes 
the system highly scalable. 


Coordinated Control Pattern. Sometimes controlling the elasticity of a sys- 
tem in a centralized component is unfeasible. Such a lack of feasibility may arise 
for several reasons, among which the scale of the system and the presence of 
multiple ownership domains. As regards the former issue, a large scale system 
makes difficult (or impractical) to quickly move all the monitored data to a sin- 
gle node, which is prone to become the system bottleneck. Nevertheless, in such 
a context, we still need to develop a system which can control the system elas- 
ticity so to meet certain QoS attributes. In this case, multiple MAPE loops can 
be employed so to control the distributed system. Each control loop supervises 
one part of the system; the resulting control loops must also coordinate with 
one another as peers so to reach, if needed, some joint adaptation decision about 
elasticity. 
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Pattern: The coordinated control pattern employs multiple MAPE loops, which 
are disseminated within the system. Each loop is in charge of controlling one part 
of the system. To compute scaling decisions, the phases of each loop can coordi- 
nate their operation with corresponding phases of other peer loops. The pattern 
does not provide regulations on the number of peer loops that should coordinate 
with one another: in some implementations peers are completely autonomous; in 
others, the cooperation is restricted to neighbor peers; and in some others all the 
peers communicate one another. Figure5 provides a graphical representation of 
this pattern. For example, the distributed Analyze components exchange infor- 
mation so to determine whether some part of the system needs to perform a scal- 
ing decision. Then, after planning the reconfiguration, the distributed Execute 
components exchange messages to synchronize the adaptation actions, which 
should be performed without compromising the application integrity. 
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Fig. 5. Flat MAPEs: coordinated control pattern. 


Example: This pattern can be useful to control elasticity when the system spans 
multiple ownership domains with no trustworthy authority to control adaptation. 
We consider the example of a monitoring application that manages smart power 
plugs (i.e., a special kind of IoT device) disseminated on multiple cities. We fur- 
ther assume that these IoT devices reside under different authority domains, e.g., 
one for each city (or neighborhood). To support the proper execution of the mon- 
itoring application, the nowadays network and computing infrastructure should 
adapt itself to support the varying load imposed by the application. Specifically, 
the IoT devices continuously emit a varying load of data that should be pushed 
towards the core of the Internet, so to reach Cloud data centers, where the 
applications extract meaningful information (e.g., predict energy consumption, 
identify anomalies). The communication between IoT devices and the Cloud 
is often mediated by IoT gateways, which allow to overcome the heterogene- 
ity of the two parts, in terms of connectivity, energy power, and availability. 
To properly control this distributed infrastructure, a MAPE control loop can 
be installed within each authority domain, so to elastically scale the number of 
resources needed to realize the communication between the involved parties (i.e., 
smart power plugs, Cloud). In this case, the Monitor component of the MAPE 
loop collects data on the working conditions of IoT devices. These data are ana- 
lyzed so to determine whether new IoT gateways should be allocated to meet the 
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application requirements. Since allocating a gateway imposes a monetary cost, 
the multiple MAPE loops can coordinate their action so to limit the execution 
costs and do not exceed the allocated budget. Ultimately, when a scaling action 
is granted, the Execute component starts a new IoT gateway on the authority 
domain specified by the Plan component. 


Information Sharing Pattern. Some large scale systems comprise distributed 
peers which cooperatively work to accomplish tasks. In particular, each peer is 
able of performing some tasks (e.g., it offers services), but could require an inter- 
action with other peers to carry out these tasks (e.g., to solve service dependen- 
cies). Examples of this scenario come from the pervasive computing domain like 
ambient intelligence or smart transportation systems, where peers work together 
to reach some common goals. Each distributed peer can locally take scaling 
decisions. Nevertheless, since a local adaptation may influence the other system 
components, taking scaling decisions require some form of coordination that can 
be reached by sharing information among system components. 


Pattern: The information sharing pattern is a special case of coordinated con- 
trol pattern, where the interaction between the decentralized MAPE control 
loops involves only the Monitor phase (see Fig. 6). The pattern does not strictly 
regulate the way peers interact with one another: for example, when the sys- 
tem comprises a large number of peers, only a subset of them (i.e., neigh- 
bors) exchange monitoring information. The following MAPE phases operate 
on (approximately) the same view of the system, thus allowing the Analyze, 
Plan, and Execute phases to be performed locally. On the one hand, this pat- 
tern helps to realize scalable and elastic systems. On the other hand, since there 
is no explicit coordination among peers (i.e., they operate autonomously), con- 
flicting or sub-optimal scaling actions can be enacted; in the worst case, the 
system enters in an unstable state, where adaptation actions are continuously 
applied. 


` ` 
` a ` a 


` , ` a 


x As 
H-E- H-E- 


Msa P oe Wook CPE 
A = 


, s , . 


, “a / “ 


Fig. 6. Flat MAPEs: information sharing pattern. 
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Example: Relying on this pattern, the system can elastically acquire and release 
resources in a fully decentralized manner, leveraging only on monitoring informa- 
tion which is shared among the distributed controllers. We consider the problem 
of executing long-running workflow of services in a fully decentralized system. 
The system comprises peers that run and expose some services. A peer can 
receive requests of service workflow execution; a service workflow is a graph of 
abstract services (i.e., definition of required services) that needs to be resolved 
in a set of concrete services (i.e., implementation of abstract services). To real- 
ize the service choreography, each peer needs to discover the services offered 
by other peers, together with their utilization level, so to determine the best 
mapping that satisfy the workflow requirements (e.g., minimum response time, 
maximum throughput). Similarly to the approach presented in [15], the system 
can employ the information sharing pattern to share, among peers, knowledge 
about the services offered by peers and their utilization state. Relying on this 
information, at run-time, the service choreography can be adapted so to auto- 
matically scale the number of concrete services to be used to run the workflow. 
Aside the shared monitoring information, the scaling decisions are performed 
locally to each peer. 


6 Challenges and Future Perspectives 


Although many research efforts have investigated how to efficiently achieve elas- 
ticity, most of them relies on a centralized Cloud environment. With the diffu- 
sion of the edge/fog computing, we have witnessed a paradigm shift with the 
execution of complex systems over distributed Cloud and edge/fog computing 
resources, which brings system components closer to the data, rather than the 
other way around. This new environment, which offers geo-distributed resources, 
promises to open new possibilities for realizing elasticity, thanks to the coopera- 
tion of computing resources at different levels of the overall infrastructure (i.e., 
at the network edge and in the network core). Nevertheless, the full potential- 
ities, together with the challenges, of this distributed edge cloud environments 
are still, to the best of our knowledge, largely unexplored. 

We identify several main challenges and research directions that could benefit 
from further investigation, so to bring improvements to the current state of 
the art. 


Strategies for Decentralization. Thanks to their widespread adoption, IoT 
devices act as geo-distributed data sources that continuously emit data. The most 
diffused approaches for processing these data rely on a centralized Cloud solu- 
tion, where all the data are collected in a single data center, processed, and ulti- 
mately sent back to (possibly distributed) information consumers. Although the 
Cloud offers flexibility and elasticity, such a centralized solution is not well suited 
to efficiently handle the increased demand of real-time, low-latency services with 
distributed IoT sources and consumers. As envisioned by the convergence of 
edge/fog and Cloud computing resources, this diffused environment can support 
the execution of distributed applications by increasing scalability and reducing 
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communication latencies. Nevertheless, in this context, computing resources are 
often heterogeneous and, most importantly, can be interconnected by network 
links with not negligible communication latencies. In this geo-distributed infras- 
tructure, a relevant problem consists in determining, within a set of available 
distributed computing resources, the ones that should execute each component 
of the distributed application, aiming to optimize the application QoS attributes. 
Nevertheless, most of the existing deployment solutions have been designed to 
work in centralized environment, where network latencies are (almost) zero. As 
such, these policies cannot be easily adapted to run in the emerging environ- 
ment. To the best of our knowledge, efficient approaches to deploy applications 
in distributed hybrid edge Cloud environments are still largely unexplored. 


Infrastructure-awareness. The convergence of distributed edge/fog environ- 
ments with Cloud environments results in a great variety of resources, whose 
peculiar features can be exploited to perform specific tasks. For example, 
resources located at the network edges, which are usually characterized by a 
medium-low computing capacity and possibly limited battery, can be used by 
a monitoring system to filter and aggregate raw data as soon as they are gen- 
erated. Conversely, clusters of specialized machines (e.g., [1]) can be exploited 
to efficiently perform machine learning tasks. Most of the existing distributed 
systems, which manage data coming from decentralized sources, are infrastruc- 
ture oblivious, i.e., their deployment neglects the peculiar characteristics of the 
available computing and networking infrastructure. In the IoT context, where 
huge amount of data have to be moved between geo-distributed resources, inef- 
ficient exploitation of resources can strongly penalize the resulting performance 
of distributed applications. To deliver efficient and flexible solutions, next gen- 
eration systems should consider, as key factor, the physical connection and the 
relationship among infrastructural elements. 


Elasticity in the Emerging Environment. The combination of edge/fog and 
Cloud computing results in a hierarchical architecture, where multiple layers are 
spread as a continuum from the edge to the core of the network. The presence 
of multiple layers, each with different computational capabilities, opens a wide 
spectrum of approaches for realizing elasticity. For example, we could scale hori- 
zontally the application components, so to use multiple resources that belong to 
the same infrastructural layer (i.e., edge, Cloud); alternatively, we could employ 
resources belonging to multiple layers, so to take advantage of their different 
features (e.g., the computational power of Cloud servers, the closeness of edge 
devices). Moreover, the presence of multiple degrees of freedom raises concerns 
regarding the coordination among the different scaling operations. When is it 
more convenient to use resources from the same layer? When should we employ 
resources from multiple layers? Can communication delays obfuscate the benefit 
of operating with resources belonging to multiple layers? 


The Cost of Elasticity. Reconfiguring an application at runtime involves the 
execution of management operations that enact the deployment changes while 
preserving the application integrity. The latter is a critical task especially when 
the application includes components that cannot be simply restarted on a new 
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location, but require, e.g., to export and import on the new location some inter- 
nal state. Therefore, together with long term benefits, adapting the application 
deployment also introduces some adaptation costs, usually expressed in terms 
of downtime, that penalize the application performance in the short period. 
Because of these costs, reconfigurations cannot be applied too frequently. A key 
challenge is to wisely select the most profitable adaptation actions to enact, so 
to identify a suitable trade-off, in terms of performance, between application 
elasticity and adaptation costs. 


Multi-dimensional Elasticity. Besides resource elasticity, we can identify dif- 
ferent elasticity dimensions, as envisioned by Truong et al. [72]. Examples of 
other dimensions are cost, data, and fault tolerance. Indeed, during the execu- 
tion of a complex distributed system, the cost of using computing resources or 
the benefits coming from the output of the system may change at runtime. Simi- 
larly, the quality of data can be elastically managed, in a such a way that when it 
is too expansive to produce results with high quality, we can tune the system to 
temporary degrade result quality, in a controlled manner, so to save resources. 
For example, this could be helpful during congestion periods, when we might 
accept to discard a wisely selected subset of the incoming data. As regards fault 
tolerance, for some kinds of applications, we might be willing to sacrifice fault 
tolerance during congestion periods so to perform computation with reduced 
costs. As expected, finding an optimal trade-off between the different elasticity 
dimensions strongly depends on the application at hand and, in general, is not 
an easy task. 


Resource Management. The resulting infrastructure is complex: multiple 
heterogeneous resources are available at different geo-distributed locations; dis- 
tributed applications expose different QoS attributes and requirements; and dif- 
ferent elasticity dimensions can be controlled. Moreover, the elastic adaptation of 
applications might require infrastructure-awareness, that enables to conveniently 
operate at different levels of the computing infrastructure. 

To rule this complexity, a new architectural layer should be designed so to 
support the execution of (multiple) applications over a continuum set of edge/fog 
and Cloud resources. This intermediate layer can be implemented as a distributed 
resource manager, which should be able to efficiently control the allocation of 
computing and network resources, by conveniently exposing different views of 
the infrastructure. On the one hand, the resource manager allows to fairly exe- 
cute multiple applications by better exploiting the presence of resources. On the 
other hand, by taking care of managing the computing infrastructure, it enables 
distributed applications to more easily control their elasticity. 

A side effect of the introduction of a resource manager is the need of designing 
standardized interfaces between the applications and the decentralize resources. 
To the best of our knowledge, today there are no standard mechanisms that allow 
resources to announce their availability to host software components as well as 
for distributed applications to smoothly control edge/fog and Cloud resources. 


Accountability, Monitoring, and Security. Together with the specific chal- 
lenges previously identified, we have several other more general challenges. 
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They regard the accountability of resource consumption, the monitoring of elas- 
tic applications/systems, and security aspects that arise from multi-tenancy and 
data distribution across several locations. 

We need to investigate methodologies for the accountability, because in the 
envisioned edge/fog computing environment, users can flexibly share their spare 
resources to host applications. The hybrid resource continuum from edge/fog 
to Cloud calls for studying dynamic pricing mechanisms, similar to the spot 
instance pricing from Amazon EC2 service?. 

The ability of monitoring the elasticity of a system/application deployed in 
a large-scale, dispersed and multi-provider hybrid environment requires inves- 
tigation. How to quantify and measure the elasticity of a complex distributed 
system? As regards elasticity, we can quantify its performance by considering 
the number of missing or superfluous adaptations over time, the durations in 
sub-optimal states, and the amount of over-/under-provisioned resources [35]. 
However, how to measure such quantities in a dispersed, large-scale environ- 
ment with multiple providers turns out to be challenging. 

Similarly to Cloud computing, we need to identify (or develop) efficient busi- 
ness models that support and encourage the diffusion of trusted computing 
resources and the elasticity requirements for such business models. One of the 
most important challenge arises from the lack of central controlling authorities in 
the edge/fog computing environment, which makes it difficult to assert whether 
a device is hosting an application component. Security aspects are of key impor- 
tance, because nowadays the value of data is very high and an infrastructure 
that does not guarantee stringent security properties will be hardly adopted. 
Similarly for the accountability issue, the decentralization of the emerging envi- 
ronment requires to deal with the lack of a central security authority. Sophisti- 
cated yet lightweight security mechanisms and policies should be introduced, so 
to create a disseminated trustworthy environment. 


7 Conclusions 


In this chapter, we presented an analysis of QoS-based elasticity for service 
chains in distributed edge Cloud environments. Firstly, we introduced the elas- 
ticity concept that arises in emerging systems of systems, which are complex, 
distributed, and based on various virtualization technologies. Then, we focused 
on IoT and Cloud systems, in whose context we elaborated the need and meaning 
of elasticity. 

A key ingredient of elasticity is the optimization technique aiming to optimize 
some QoS attributes. Firstly, we identified the key attributes that are frequently 
optimized with elasticity. Then, we introduced a software engineering viewpoint 
to model elasticity as one of the system attributes. In that respect, elasticity 
mechanisms can be implemented in the system design phase to model software 
systems that exploit at best elasticity during runtime. Furthermore, elasticity 
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involves a runtime choice for the best optimal solution and such a selection 
has also to be properly designed. Therefore, we reviewed the research works 
on modeling elasticity in the context of design and runtime choices aiming to 
provide the best elasticity model and optimal solution. 

In distributed environments, elasticity mechanisms may arise not only at 
different layers of system abstraction, but also within each segment of the dis- 
tributed system that, as a whole, has to deliver service to the end users. There- 
fore, key elements for running QoS-aware service compositions are the coordi- 
nation mechanisms; the latter have to be efficiently implemented in order to 
deliver high-level user-experience. In this chapter, we also provided a review of 
several design patterns for decentralized coordination, aiming to realize elasticity 
in complex systems. 

Finally, we discussed the challenges related to designing elasticity mecha- 
nisms in geo-distributed environments. Software engineering decisions and coor- 
dination mechanisms among segments of distributed systems need further inves- 
tigation based on empirical evidence from the real technical environments. 
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Abstract. Traditional networks are transformed to enable full integra- 
tion of heterogeneous hardware and software functions, that are config- 
ured at runtime, with minimal time to market, and are provided to their 
end users on “as a service” principle. Therefore, a countless number of 
possibilities for further innovation and exploitation opens up. Network 
Function Virtualization (NFV) and Software-Defined Networking (SDN) 
are two key enablers for such a new flexible, scalable, and service-oriented 
network architecture. This chapter provides an overview of QoS-aware 
strategies that can be used over the levels of the network abstraction 
aiming to fully exploit the new network opportunities. Specifically, we 
present three use cases of integrating SDN and NFV with QoS-aware 
service composition, ranging from the energy efficient placement of vir- 
tual network functions inside modern data centers, to the deployment of 
data stream processing applications using SDN to control the network 
paths, to exploiting SDN for context-aware service compositions. 


1 Introduction 


Software-Defined Networking (SDN) is a new paradigm that provides pro- 
grammability in configuring network resources. It introduces an abstraction layer 
on the network control layer that allows runtime and ad-hoc network reconfig- 
uration. Therefore, it enables to adapt at runtime not only physical network 
resources but also software services that compose complex services delivered to 
end users. Such a new network feature thus provides a valuable mechanism to be 
exploited in the modeling of QoS-aware service compositions integrating services 
from various networks. This paradigm has been successfully incorporated into 
the virtualization of the telecommunication network and an architecture concept 
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called Network Function Virtualization (NFV), where virtual network functions 
are interconnected into service compositions to create communication services. 

Traditional networks that have been designed for yesterday peak require- 
ments are inefficient to cope with nowadays massive communication traffic 
injected by a large number of users (e.g., billions of devices in the Internet of 
Things). The main obstacle of traditional networks to provide full exploitation of 
their resources and accelerate innovation is caused by the lack of integration of 
the variety of hardware and software appliances. Moreover, the lack of standard- 
ized interfaces make network management costly and slow adapting to modern 
trends, and user demands [14, 20,27]. 

Within the 5G network, SDN and NFV are the two key technologies intro- 
duced as enablers [33]. In future networks, the optimal cost is achieved through 
dynamic and self-adaptive deployment on a network infrastructure which is con- 
tinuously controlling its performances and autonomously managing its resources. 
The primary goal of such a dynamic and autonomous deployment is to accom- 
plish and maintain the quality of service (QoS) requirements of complex services. 
By adopting SDN and NFV for the composition of complex services, Software- 
Defined Service Composition (SDSC) [21] separates the execution of service com- 
positions from the data plane of the overall system. 

SDSC facilitates the integration and interoperability of more diverse imple- 
mentations and adaptations of the services. A reliable execution of service com- 
position can be guaranteed through the network management capabilities offered 
by SDN, in finding the best alternative among various service implementations 
and deployments among the multiple potential services deployments for the 
service composition execution. SDSC thus offers an increased control over the 
underlying network, while supporting the execution from various traditional web 
service engines and distributed frameworks. 

There are various modeling approaches for QoS-aware service composition 
which have been proposed so far. With the introduction of a programmable 
approach to implement and use network resources, we should investigate per- 
formance modeling approaches that jointly consider all network layers and their 
composite behavior and outputs. Therefore, the contribution of this chapter is 
to analyze the integration of SDN and NFV in modeling the performance of 
service compositions and investigate possible side effects that can arise from 
their composite interactions. To this end, we present three different use cases of 
integrating SDN and NFV with QoS-aware service composition, ranging from 
the energy efficient placement of virtual network functions inside modern data 
centers, to the deployment of data stream processing (DSP) applications using 
SDN to control the network paths, to exploiting SDN for context-aware service 
compositions. 

In the upcoming sections of this chapter, we continue to discuss the benefits 
and use cases of integrating SDN and NFV with QoS-aware service composition. 
Section 2 provides an overview of the basic concepts: SDN, NFV, and service 
compositions. Section 3 discusses the energy-efficient green strategies enabled by 
the integration of SDN and NFV with service compositions. Section 4 focuses 
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on a specific example of composite service - represented by DSP applications - 
and elaborates on the integration of a DSP framework with an SDN controller, 
showing a full vertical integration of the application and network layers. Section 5 
discusses how SDN can offer context-aware service compositions. Finally, we 
discuss the benefits and open research issues in QoS-aware service compositions 
in Sect.6 and conclude the chapter by identifying future research directions in 
Sect. 7. 


2 Overview of Basic Concepts 


A traditional network architecture divides Telco/Network operators from Inter- 
net Service Providers (ISPs) and Content Providers. Services are provided over 
highly specialized technologies which limit their full exploitation by end users. 
A new network architecture that is proposed for future networks introduces new 
abstraction layers with standardized interfaces that would enable Telco/Network 
Providers, ISPs, and Content Providers to provide their services over the web, 
independently from the underlying network. The vision of future networks is to 
provide their users with complex services that result from the autonomous com- 
position of simple, possibly legacy, elementary services. Such a service orientation 
has also been recently reaffirmed for the next decade in the Service Computing 
manifesto [6], that call for the widespread adoption of service computing. 


2.1 Introduction to NFV 


The basic concept of NFV is to apply Cloud computing technologies to realize 
telecommunication applications. NFV revolves around the concept of virtualiza- 
tion, which enables to run multiple systems in isolation on a single hardware sys- 
tem. The exploitation of virtualization allows to decouple network functions from 
the related (dedicated) hardware [17]. In other words, a software implementation 
of different network functions (e.g., modulation, coding, multiple access, firewall, 
deep packet inspection, evolved packet core components) can be deployed on top 
of a so-called hypervisor, which runs on commercial off-the-shelf servers instead 
of dedicated hardware equipment. The hypervisor provides for virtualization 
and resource management (e.g., scheduling access to CPU, memory, and disk 
for the network functions). In addition, an orchestration framework needs to be 
in place, so to combine different virtual functions to obtain higher layer service 
chains implementing the end-to-end service. Moreover, the orchestration frame- 
work manages the deployment (e.g., which virtual function to place on what 
physical server) and the life cycle of the virtual network functions, including the 
management of their scalability. The latter comprises several tasks, among which 
monitoring performance, scaling either vertically or horizontally resources (i.e., 
either acquiring more powerful computing resources or spawning more replicas 
of the same virtual network function and load balancing among them). 
Consequently, Virtual Network Functions (VNFs) are different from classical 
server virtualization technologies because VNF may form service chains com- 
posed of multiple virtual network functions, that exchange traffic which may be 
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deployed on one or multiple virtual machines running different network func- 
tions and replacing thus a variety of hardware appliances [33]. Such software 
implementation of network functions is easily portable among different vendors 
and may coexist with hardware-based platforms. Thus, the main benefits pro- 
vided are a reduction of capital and operational expenditures, offering a reduced 
time-to-market as well as scalability to different resource demands. 

However, with the introduction of VNFs, additional problems may arise, such 
as increased complexity. Additional interfaces need to be defined and maintained 
(e.g., between the hypervisor and the orchestration system), which leads to more 
complex system design. In addition, as applications can have strict requirements 
in terms of latency, performance guarantees are more difficult to be satisfied. 
This is because a given implementation of a VNF may perform differently when 
deployed on different hardware. For example, the deployment of I/O intensive 
VNF (e.g., a home subscriber service) on a server equipped with a standard HDD 
may lead to lower performance than the one resulting from a deployment on a 
server equipped with an SSD or NV-RAM. Consequently, new benchmarking 
tools are required that allow correlating the performance of a given VNF when 
deployed on a given hardware with a certain configuration. 


2.2 Introduction to Service Composition Using SDN 


The second enabling technology is SDN, which separates the network control 
plane from the infrastructure (data) plane [31]. It involves logical centralization 
of network intelligence and introduces abstraction of physical networks from 
the applications and services via standardized interfaces. SDN is considered 
an enabling technology for high volumes of traffic flows and responds “at run- 
time” on dynamic demand for network resources by avoiding time-consuming and 
costly manual reconfiguration of the network. Thus, it increases network resource 
exploitation and decreases time to market. Furthermore, service-orientation is 
introduced to enable the runtime discovery and deployment of services. When 
combined with NFV and SDN technologies, this feature can significantly improve 
the efficiency of network operations. 

Figure 1 presents a high-level architecture, emphasizing three distinct man- 
agement layers that are coordinated by a vertical deployment manager to provide 
possibly coordinated QoS-aware decisions about service deployment. 

At the infrastructure layer, routers and switches are distributed over the 
network topology. These devices have their logical representation that is used 
for control and management purposes. Decisions of centralized network control 
are transferred over the standardized physical interfaces to operate over devices 
in this layer. 

Network resources are virtualized in the virtualization layer. Each virtual 
resource has its logical representation that enables efficient management. The 
virtual resources may be interconnected into a graph-like topology. Again, 
autonomous decisions about their interconnection and placement are subject 
of the management entity at this layer. 
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Fig. 1. High-level network overview. 


Finally, at the application layer a number of basic component services are 
available in distributed data centers and exposed in service libraries. The com- 
plex services may be composed of many basic services that are accessible through 
service registries and can be composed on the basis of different goals. In the three 
use cases, we present later in this chapter, we consider network service chains, 
Web service and eScience workflows, and data stream processing (DSP) applica- 
tions. A network service chain allows assembling services out of multiple service 
functions typically using basic patterns for service composition, e.g., a sequence 
of VNFs, with one or multiple instances needed for each VNF. Web services and 
eScience workflows usually organize their component services using more com- 
plex workflow patterns, e.g., conditional choice, loops, and fork-and-join. Finally, 
a DSP application is represented as a directed acyclic graph, that can be seen 
as a workflow diagram. 

A service composition deployment on top of SDN allows cross-layer optimiza- 
tions, as the services interact with the SDN controller through its northbound 
Application Programming Interface (API) protocols and using REpresentational 
State Transfer (REST) [39], Advanced Message Queuing Protocol (AMQP) [42], 
and Message Queue Telemetry Transport (MQTT) [32] transport protocols. On 
the other hand, the SDN controller orchestrates the data center network that the 
services are deployed on, through its southbound API protocols such as Open- 
Flow [28] and NetConf [16]. Such a cross-layer optimization supported by SDN 
allows QoS guarantees at the service and network levels. 
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NFV and SDN do not rely on each other. NFV is providing flexible infras- 
tructure, while SDN software can run and can provide flow-based configura- 
tion of network functions. Both technologies, when used in cooperation, can 
offer enhanced QoS guarantees. In such new network architecture, the network 
logic is abstracted on several layers of abstraction. The management decisions of 
each layer may have reflections on the QoS provided by the network. Thus, the 
selection of collected management decisions within deployment manager should 
balance between flexibility provided at each level of network abstraction and 
optimal QoS. 

An ongoing standardization endeavor is Next Generation Service Overlay 
Network (NGSON), aiming to establish a collaborative framework among the 
stakeholders from various networks and technology paradigms in order to unify 
their vision on common service delivery platform. Thus, the end-user need for 
complex service delivery across the network borders would be satisfied. The 
standard aims to identify self-organizing management capabilities of NGSON 
including self-configuration, self-recovery, and self-optimization of NGSON nodes 
and functional entities. 


2.3 Overview of Use Cases 


In this chapter, we will look into three illustrative use cases of integrating SDN 
and NFV with QoS-aware service composition. 

Section 3 presents an overview on green strategies for VNF embedding, sup- 
ported by SDN and NFV. Here, the key idea is to manage the NFV infras- 
tructure, namely the composition of compute and networking resources includ- 
ing servers and networking equipment in an energy efficient way. By powering 
down unused servers and switches, the total energy of the infrastructure can be 
minimized. Important questions to ask are then what is the minimum number 
of servers, switches, and links that are necessary in order to provide the SLA 
desired for the service chains that need to be embedded into the physical net- 
work and compute infrastructure, where to place the functions and how to route 
the service chain traffic in order to find a balance between energy efficiency, 
performance and SLA. 

Section4 presents how the integration of an SDN controller with a DSP 
framework allows to adjust the network paths as per-application needs in the 
Qos-aware deployment of DSP applications on the computing and network 
resources. In the proposed integrated framework, SDN is used to expose to the 
DSP framework the network topology and network-related QoS metrics. Such 
information is exploited in a general formulation of the optimal placement prob- 
lem for DSP applications, which jointly addresses the selection of computing 
nodes and of network paths between each pair of selected computing nodes. 

We define services that access, process, and manage Big Data as big services. 
They pose computation and communication challenges due to their complex- 
ity, volume, variety, and velocity of Big Data they deal with. Moreover, they 
are often deadline-bound and mission-critical. Each big service is composed 
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of multiple services to be able to execute it in the Internet-scale at the dis- 
tributed clouds. Such a componentization of big service improves its resilience 
and latency-awareness. For example, consider a big service for weather fore- 
cast. It consists of various services including sensor data retrieval, data analysis 
services, and prediction. These component services are inherently distributed, 
including the ones that manage the actuators and the sensors in land, sea, and 
satellites. By leveraging the SDN and NFV paradigms, SDSC ensures an effi- 
cient service composition from the replicated and globally distributed services. 
Section 5 discusses how SDSC leverages SDN to build and efficiently execute 
complex scientific workflows and business processes as service compositions. 


3 Green Strategies for VNF Embedding 


Next generation 5G networks will rely on distributed virtualized datacenters to 
host virtualized network functions on commodity servers. Such NFV will lead to 
significant savings in terms of infrastructure cost and reduced management com- 
plexity. Virtualization inside modern datacenters is a key enabler for resources 
consolidation, leading towards green strategies to manage both compute and 
network infrastructures where VNFs are hosted. However, green strategies for 
networking and computing inside data centers, such as server consolidation or 
energy aware flow routing, should not negatively impact on the quality and 
service level agreements expected from network operators, given that enough 
resources exist. For example, given two different resource allocation strategies, 
one focusing on performance while the other focusing on energy efficiency, while 
both strategies may lead to a resource allocation that satisfies user demands and 
SLAs, a green strategy does so by minimizing the energy consumption. Once 
fewer resources are available than requested, green strategies should guide the 
resource allocation processes towards operational points that are more energy 
friendly. 

Important tools available for Cloud Operators are server consolidation strate- 
gies that migrate Virtual Machines (VMs) towards the fewest number of servers 
and power down unused ones to save energy. As VNFs are composed of a set of 
VNF Components (VNFC) that need to exchange data over the network under 
capacity and latency constraints, the networking also plays an important part. 
By using SDN, one can dynamically adjust the network topology and available 
capacity by powering down unused switch ports or routers that are not needed to 
carry a certain traffic volume [19], thus consuming the least amount of energy at a 
potential expense of higher latency. Green strategies try to place the VNFC onto 
the fewest amount of servers and to adjust the network topology and capacity to 
match the demands of the VNFCs while consuming the least amount of energy 
for operating the VNF Infrastructure. Such design of the VNF placement and 
virtual network embedding can be formulated as a mathematical optimization 
problem, and efficient heuristics can be designed to quickly solve the problem. 

We can consider the Virtualized Compute and Network Infrastructure as 
the set of hardware resources (which is comprised of the compute and network 
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infrastructure) that is hosting a certain number of VNFs inside a virtualized 
data center. The virtualized data center can be geo-distributed to serve different 
users at different locations using the lowest cost in terms of energy, network, etc. 
We assume that each VNF is made of a set of service chains, which is a group 
of VNFC which have a set of traffic demands and a maximum tolerable latency 
allocated towards them. More precisely, the traffic demands specify how much 
traffic, between two adjacent services in a chain, the first sends to the second 
one. A service needs resources, e.g., in terms of CPU, memory, disk, and so on, to 
process packets and then forward the processing results to the next component 
of the chain. 

The latency of a service chain is the sum of the experienced delays on the 
used paths, on which all the demands of the service chain are forwarded. It 
also includes the host internal processing related latency, which may be different 
for different architectural setups. For example, using standard Linux networking 
approach leads to much higher latency and less available capacity compared to 
using the recently developed approaches for user-mode packet forwarding and 
processing based on proprietary techniques, such as Intel’s Data Plane Devel- 
opment Kit (DPDK).! Similarly, Single Root Input/Output Virtualization (SR- 
IOV?) is an extension to the PCI-express standard that allows different virtual 
machines (VMs) hosting the VNFs in a virtual environment to share a single 
network card over fast PCI-express lanes. Consequently, the additional latency 
for VNF packet processing depends on the virtualization technology used in the 
servers, which may be different for different server types. In addition, when two 
VNFC are placed on the same server, there is also a not negligible overhead 
when forwarding the packets from one component to another (after proper pro- 
cessing) and this overhead (and thus the additional latency and capacity limits) 
also depends on the virtualization technology used. 

In the following, we assume that we have available a set J of servers and 
a network graph G(N, E), where N represents the set of network nodes and Æ 
denotes the links among them. Given the family of service chains, which are 
defined as a specific number of traffic demands between couples of a subset 
V cC V out of all VNFC, the objective of the problem is to allocate all the 
VNFCs on the servers and to find the network routes that satisfy the traffic 
demands while minimizing the overall power consumption Py jy; of the Virtual 
Network Infrastructure, which is the sum of the power consumption of the com- 
pute (Pervers) and network infrastructure (Pswitches), given the latency, resource 
and bandwidth capacity budgets: 


min f = Pyni = Pervers T Pswitches (1) 
The key idea for developing green strategies is to place the network functions on 


the minimum number of servers and use the minimum number of highly energy 


1 https: //software.intel.com/en-us/networking/dpdk. 
? https: //www.intel.com/content /dam/doc/white-paper /pci-sig-single-root-io-virtua 
lization-support-in-virtualization-technology-for-connectivity-paper.pdf. 
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efficient network nodes that can serve the required capacity. Consequently, all 
unused servers and switches can be powered down to reduce energy consumption. 


3.1 Power Model Examples for Compute and Network 
Infrastructure 


Several power models have been proposed for the compute infrastructure. Typ- 
ically, they assume that the CPU of a server is the most power hungry compo- 
nent [35], and consequently most models just consider the power consumption 
due to CPU load. In general, the relationship between server power consumption 
and CPU utilization is linear [24,36] with some small deviations that are due to 
processor architecture, CPU cache related aspects and compiler optimizations 
leading to a different CPU execution. For performance modeling of green server 
and network consolidation strategies, we can simplify that for each server j there 
is a unique idle power consumption Pjgie,;, which denotes the energy required 
by the server when it is just powered on and does not run any compute (except 
the basic Operating System and management services). The maximum power 
consumption Pmaz,; denotes the power consumed by the given server when all 
the CPU cores are under full load. In between the two extreme cases, the power 
consumption follows a linear model dependent on the CPU utilization. 

The network related power consumption can also be simplified to make it 
tractable in numerical models. For example, the work in [5] assumes that for 
network switches there are two main components that impact the total power 
consumption. A static and constant power is required to power the chassis and 
the line cards, which is independent of the traffic that the switch serves and 
the number of ports used. In addition, depending on the number of ports per 
line rate are powered on, there is a dynamic power consumption, which also 
depends on the link speed the port is using (e.g., 1 Gbps or 10 Gbps) and the 
dynamic utilization of the ports. The power consumption also depends on the 
switch manufacturers: the work by Heller et al. [19] provides an overview on 
the power consumption of three different 48-port switch models. For example, 
one switch has a power consumption of 151 W when the switch is idle and all 
the ports are powered down, while it increases to 184 W when all the ports are 
enabled and to 195 W when all the ports serve traffic at 1 Gbps. As one can see, 
just powering on a switch requires the highest amount of power, while powering 
on additional ports does not add much to the total power consumption while the 
traffic dependent power consumption is almost negligible. Consequently, many 
green strategies try to conserve energy by powering down unused switches and 
power down unused ports. 


3.2 Illustrative Example 


In this section, we provide a simple example to illustrate the problem in Fig. 2. 
We assume there are seven servers (labeled from sı to s7), each one with its 
own dedicated power profile specified by a given idle power P?””” and maxi- 
mum power consumption P?"*”. Each server has limited resources in terms of, 
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e.g., CPU, memory and disk capacities. To be more specific, server s; has avail- 
able ay; CPU, a2; RAM and a3; DISK. Each server is connected to a specific 
router (e.g., the Top of Rack Switch in case of a Data Center). 
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Fig. 2. The joint VNF placement and network embedding problem [26]. 
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Each link that connects the servers to the switch or the switches with each 
other has a dedicated capacity and latency. In the example, the latency for 
the link between n; and nz is denoted as l12. The latency has typically several 
components. The first one is the latency due to the capacity that the links oper- 
ate, which is constant. There is also latency due to the virtualization technique 
applied, which depends on the load of the servers and other configurations (e.g., 
CPU cache misses). Furthermore, there is a load-dependent latency due to queu- 
ing, which is typically non-linear. However, under low load, such latency can be 
assumed to be linearly increasing, while under higher load, we can use a piece- 
wise approximation to model the latency due to traffic being routed over the 
interface. In addition, each link has a dedicated capacity (omitted from Fig. 2 
due to complexity). 

In the given example, we should embed into this NFV Infrastructure three 
service chains (c1, C2 and c3). Each service chain has its unique latency bound, 
a dedicated traffic source S1, S2 and S3 and sink D,, Dz and D3. For example, 
in 5G for machine-to-machine traffic low latency should be enforced while for 
multimedia traffic latency bounds could be more relaxed. Also, the model can be 
specified flexibly to model also control plane related service chains, with more 
stringent delay requirements. In the example, we have three different VNFCs 
(v1, v2 and v3) and we assume that the traffic source for cı is the Sender $1, 
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which is connected to router nz and injects a certain volume of traffic into the 
service chain towards vı. Then, vı processes the packets (for which it needs 
resources such as CPU, memory, and disk) and forwards the processed traffic 
(which may have a different volume than the one injected) towards VNFC vy, 
which again processes it and forwards a certain volume to the destination Dı 
that is connected to router ng. 

Note that Fig.2 assumes additional source/sink nodes where traffic for a 
service chain is created/terminated. The figure shows an example of joint VNF 
placement and network embedding into the physical substrate network. VNFC 
vı would be placed onto server s3, vz onto server s4, and so on. Servers hosting 
no VNFC would be powered down (s1, S2, 85, 87) together with all the nodes 
not carrying any traffic (n1). 


4 Integrating SDN into the Optimal Deployment of DSP 
Applications 


In the section, we present a use case of integrating SDN with QoS-aware service 
composition that focuses on Data Stream Processing (DSP) applications. The 
advent of the Big Data era and the diffusion of the Cloud computing paradigm 
have renewed the interest in DSP applications, which can continuously collect 
and process data generated by an increasing number of sensing devices, to timely 
extract valuable information. This emerging scenario pushes DSP systems to a 
whole new performance level. Strict QoS requirements, large volumes of data, and 
high production rate exacerbate the need for an efficient usage of the underlying 
infrastructure. The distinguishing feature of DSP applications is their ability to 
processing data on-the-fly (i.e., without storing them), moving them from an oper- 
ator to the next one, before reaching the final consumers of the information. A DSP 
application can be regarded as a composition of services [1] with real-time pro- 
cessing issues to address. It is usually modeled as a directed acyclic graph (DAG), 
where the vertexes represent the processing components (called application opera- 
tors, e.g., correlation, aggregation, or filtering) and the edges represent the logical 
links between operators, through which the data streams flow. 

To date, DSP applications are typically deployed on large-scale and central- 
ized (Cloud) data centers that are often distant from data sources [18]. However, 
as data increase in size, pushing them towards the Internet core could cause 
excessive stress on the network infrastructure and also introduce high delays. A 
solution to improve scalability and reduce network latency lies in taking advan- 
tage of the ever-increasing presence of near-edge/Fog computing resources [4] 
and decentralizing the DSP application, by moving the computation to the edges 
of the network close to data sources. Nevertheless, the use of a diffused infras- 
tructure poses new challenges that include network and system heterogeneity, 
geographic distribution as well as non-negligible network latencies among dis- 
tinct nodes processing different parts of a DSP application. In particular, this 
latter aspect could have a strong impact on DSP applications running in latency- 
sensitive domains. 
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Fig. 3. DSP framework with SDN controller integration. 


To address these challenges, we have proposed the solution depicted in Fig. 3 
and named SDN-integrated DSP Framework (for short, SIDF), which combines 
and integrates a DSP application framework with an SDN controller. To this 
end, we have: 


— extended the architecture of Apache Storm, a well known open-source DSP 
framework, by designing, developing, and integrating few key modules that 
enable a distributed QoS-aware scheduler architected according to the MAPE 
(Monitor, Analyze, Plan, and Execute) reference model for autonomic sys- 
tems [7,8]; 

— designed, developed and implemented the controller logic for standard SDN 
controller and the associated API to provide network monitoring and dedi- 
cated stream routing configuration in an SDN network. 


The proposed solution represents a full vertical integration of the application 
and network layers. The resulting architecture is highly modular and capable of 
taking full advantage of the SDN paradigm in modeling and optimizing the per- 
formance of Fog-based distributed DSP applications. In particular, SIFD enables 
the cross-layer optimization of the Fog/Cloud and SDN layers, whereby the SDN 
layer exposes to the upper layer the network topology and QoS metrics. This 
allows the optimal deployment of DSP applications by exploiting full knowl- 
edge of the computational and network resources availability and status. In this 
setting, an optimal deployment algorithm determines not only the application 
components placement on the underlying infrastructure but also the network 
paths between them. 
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For the sake of comparison with a non-SDN based solution, the proposed 
solution is backward compatible with legacy IP network, whereby network paths 
are solely determined by the underlying routing protocol and cannot be adjusted 
as per-application needs, thus providing no control by the DSP framework. In 
this setting, the DSP manager can at most monitor the network performance 
between candidate endpoints (see, e.g., [13] for a scalable network monitoring 
service) and determine operator placement on the underlying infrastructure by 
taking account the observed network delays. 


4.1 The SIDF Architecture 


SIDF uses a layered architecture to combine a DSP framework with an SDN 
controller (Fig. 3). The layered infrastructure enforces separation of concerns and 
allows to obtain a loosely coupled system. Each layer realizes new functionalities 
on top of lower-level services and exposes them as a service to the higher layer. 
SIDF comprises three main layers: infrastructure layer, network control layer, 
and application layer. 

At the lowest level, the infrastructure layer and the network control layer rep- 
resent the classical SDN network. Specifically, the infrastructure layer comprises 
network equipment, such as SDN devices and legacy IP devices. The former 
enables to monitor and dedicate communication paths, whereas the latter only 
exposes paths as black-boxes, resulting from their routing protocol. 

The network control layer manages the heterogeneity of network devices 
and controls their working conditions. SIDF includes a network controller that 
realizes two functionalities: monitor and QoS routing. The monitoring compo- 
nents periodically observe the network so to extract metrics of interest; to limit 
the footprint of monitoring operations, we only retrieve network delays among 
network devices and computing nodes. Observe that these monitoring opera- 
tions can be realized in an SDN controller assisted manner as proposed in [41], 
where the SDN controller periodically sends probes on links to measure their 
transferring delays, or in a distributed manner, where neighbor SDN devices 
autonomously compute latencies. As a result, the network control layer can 
expose a view of the infrastructure as a connected graph (or network graph), 
where network devices and computing nodes are interconnected by network 
links; the latter are labeled with monitoring information (e.g., network latency). 
Observe that, with legacy IP devices, the link between two network nodes rep- 
resents the logical connectivity resulting from the routing protocols. As regards 
the QoS routing functionalities, the SDN controller allows installing dedicated 
stream routing configurations in the underlying infrastructure. Leveraging on 
the exposed network graph, the application layer of SIDF can instruct the net- 
work to route streams on specific paths, according to application needs. For 
example, the application might require to route data using either a best-effort 
path, the path that minimizes the number of hops, or the one that minimizes 
the end-to-end delay between two computing nodes. 
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The application layer includes the DSP framework, which abstracts the com- 
puting and network infrastructure and exposes to users simple APIs to execute 
DSP applications. Many DSP frameworks have been developed so far. Never- 
theless, most of them have been designed to run in a clustered environment, 
where network delays are (almost) zero [9]. Since in an infrastructure with dis- 
tributed computing resources (like in the Fog computing environment) network 
delays cannot be neglected, SIDF includes a custom distributed DSP framework 
that conveniently optimizes the execution of DSP applications. This framework, 
named Distributed Storm [8], has been implemented as an extension of Apache 
Storm [40], one of the mostly adopted open-source DSP frameworks. Distributed 
Storm oversees the deployment of DSP applications, which can be reconfigured 
at runtime so to satisfy QoS requirements (e.g., maximum application response 
time). To this end, the framework includes few key modules that realize the 
MAPE (Monitor, Analyze, Plan, and Execute) control cycle, which represents 
the reference model for autonomic systems [7,8]. During the execution of MAPE 
phases, Distributed Storm cooperates with the other layers of SIDF so to jointly 
optimize the application deployment and the QoS-aware stream routing. Specif- 
ically, during the Monitor phase, the framework retrieves the resource and net- 
work conditions (e.g., utilization, delay) together with relevant application met- 
rics (e.g., response time). Network conditions are exposed by the network control 
layer. During the Analyze phase, all the collected data are analyzed to determine 
whether a reconfiguration of the application deployment should be planned. If it 
is worth to reconfigure the application as to improve performance (or more gener- 
ally, to satisfy QoS requirements), in the Plan and Execute phases the framework 
first plans and then executes the corresponding adaptation actions (e.g., relocate 
the application operators, change the replication degree of operators). The Plan 
phase determines the optimal deployment problem, whose general formulation is 
presented in the next section. If a reconfiguration involves changing the stream 
routing strategy, the Execute phase also interacts with the network control layer, 
so to enforce new forwarding rules. 


4.2 DSP Deployment Problem 


We now illustrate the optimal deployment problem for DSP applications with 
QoS requirements. We provide a general formulation of the optimal placement 
problem for DSP applications which jointly addresses the operator placement 
and the data stream routing by modeling both the computational and networking 
resources. A detailed description of the system model can be found in [9]. 

For a DSP application, solving the deployment problem consists in determin- 
ing for each operator i: 


1. the operator placement, that is the computational node where to deploy the 
operator 2; 

2. the network paths that the data streams have to traverse from an operator i 
to each of the downstream operator 7. 
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For the sake of simplicity, here we do not consider the operator replication 
problem, that is the determination of the number of parallel replicas for each 
operator to deploy in order to sustain the expected application workload. Never- 
theless, the following arguments can be can easily extended to the general case, 
e.g., using the approach presented in [10]. 

A deployment strategy can be modeled by associating to each operator i a 
vector x! = (zj,..., xh), where zf = 1, with u € {1,..., R} representing a 
computing resource, if the operator i is placed on the node u and 0 otherwise. 
Similarly, for each stream (i, j) from operator i to operator j, the vector y“s) = 
Cae Ers uae where y? = 1, with 7 € {1,..., II} representing a network 
path, if the data stream from operator i to operator j follows the path m, and 
0 otherwise. 

The Operator Placement and Stream Routing (OPSR) problem takes the 
following general form: 


min F(x,y) (2) 
subject to: Q°(x,y) < Qe 


max 
QP (x,y) > Bin 
x,yEA 


where x = (x"!,...,x'") is the vector of the operator deployment binary variables 
and y = (y“41),... ,y(Jn)) is the vector of the network path variables. 

Here, F(x,y) is a suitable objective function to be optimized which can con- 
veniently represent application QoS metrics, e.g., response time, system and/or 
network related metrics, e.g., amount of resources, network traffic, or a com- 
bination thereof. Q°(x,y) and Q°(x,y) are, respectively, those QoS attributes 
whose values are settled as a maximum and a minimum, and x € A is a set of 
functional constraints (e.g., this latter set includes the constraint >, xf, = 1, 
which requires that a correct placement deploys an operator on one and only one 
computing node, and )> yË ) = 1, which requires that, in a correct routing, a 
stream flows on a single path). 

The formulation above represents the most general problem formulation 
whereby we jointly optimize the application deployment x, by placing the oper- 
ator on suitable nodes in the network, while at the same time determining the 
network paths y to carry the stream between operators. 

Using standard arguments, see, e.g., [9] for a similar problem, it can be proved 
that the resulting OPSR problem is NP-hard. As a consequence, efficient heuris- 
tics are required to deal with large problem instances in practice. Nevertheless, 
the proposed formulation can supply useful information for designing heuristics 
that, not only reduce the resolution time, but guarantee provable approximation 
bounds on the computed solution. 
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Fig. 4. SDN-supported placement of a DSP application. 


4.3 Illustrative Example 


OPSR determines how the computing and network resources should be utilized 
so to execute a DSP application with QoS requirements (e.g., response time, cost, 
availability). We observe that the application performance depends not only on 
computing resources, but also on network links that realize the communication 
among the computing nodes. This is especially true in geo-distributed environ- 
ment (like Fog computing) and when Big Data have to be efficiently transmitted 
and processed. The strength of OPSR is the ability to jointly optimize (i.e., in 
a single stage) the selection of computing nodes and of network paths between 
each pair of selected computing nodes. 

We exemplify the problem using Fig. 4. We consider a simple DSP applica- 
tion that filters and forwards important events to a notification service within a 
limited time interval (i.e., it has QoS requirements on response time). The appli- 
cation comprises a pipeline of three operators: a data source 01, a filter 02, and a 
connector to the external notification service o3. For the execution, OPSR has to 
identify computing and network resources from the available infrastructure that, 
in our example, comprises 4 processing nodes (r; with i € {1,...,4}), 7 network 
devices (n; with i € {1,...,7}), and 10 network links (l;, with i € {1,...,10}— 
observe that the network is not fully connected). To better show the problem 
at hand and reduce its complexity, we assume that each computing node r; can 
host at most one operator and that 0, and o3 have been already placed on rı 
and r4, respectively. Therefore, OPSR has to deploy only the filtering operator 
02 selecting between two possible choices: rz and r3. 

Interestingly, the network control layer can expose different views of the 
network, so that the upper application layer can select the most suitable network 
characteristics for running its applications. In our example, we consider that the 
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network control layer exposes paths with different QoS attributes in terms of 
communication latency and available bandwidth. 


— In case 02 is deployed on r2, OPSR has to further select the network paths for 
streams (01,02) and (02,03), which should flow between (r1,7r2) and (r2, 74), 
respectively. For the first stream (01,02), the network controller exposes mı = 
{li, l4}, with 10 ms latency and 100 Mb/s bandwidth, and m2 = {l2,l5}, 
with 25 ms latency and 1 Gb/s bandwidth. Similarly, for the second stream 
(02, 03), the network controller exposes 73 = {l6, ls}, with 10 ms latency and 
300 Mb/s bandwidth, and m4 = {l7, lio}, with 15 ms latency and 850 Mb/s 
bandwidth. 

— In case o> is deployed on r3, OPSR can determine the network paths for 
streams (01, 02) and (02,03), which should flow between (r1,7r3) and (r3,74), 
respectively. For the first stream (01, 02), the network controller exposes 75 = 
{l,,13}, with 10 ms latency and 100 Mb/s bandwidth, and re = {l2, l5, l6}, 
with 30 ms latency and 600 Mb/s bandwidth. For the second stream (02, 03), 
the network controller exposes 77 = {lg}, with 5 ms latency and 100 Mb/s 
bandwidth, and 7g = {l9, lio}, with 15 ms latency and 600 Mb/s bandwidth. 


The utilization of any of these paths is upon request, because the SDN con- 
troller has to allocate resources so to guarantee that QoS performance does not 
degrade over time (e.g., due to link over-utilization). Since selecting one path 
or another deeply changes the application performance, OPSR picks the most 
suitable one driven by the DSP application QoS requirements, which are cap- 
tured by the objective function F(x,y). Our DSP application needs to forward 
event notifications with bounds on delay, therefore it prefers to transfer data 
using the paths with minimum communication latency. Hence, OPSR maps o2 
on r3 and selects the paths m5 and 77, which introduce a limited communication 
latency of 15 ms. Observe that, in case the DSP application aimed to optimize 
the amount available bandwidth (as in case of media streaming applications), 
OPSR would have mapped o on rg and selected the paths mə and 74, which 
provide a bandwidth of 1 Gb/s and 850 Mb/s, respectively. 

Although this is a toy example, it gives a flavor of the potentialities coming 
from the cooperation between SDN and distributed DSP applications. At the 
same time, the example shows the combinatorial nature of the OPSR problem, 
which calls for the development of new efficient heuristics. 


4.4 Related Work on Big Data and SDN 


With the renewed interest in DSP applications, in the last years many research 
works have focused on the placement and runtime reconfiguration of DSP appli- 
cations (e.g., [2,9, 10,25,45] and therein cited works). However, some of these 
works [2,45] do only consider the deployment of the DSP application in a clus- 
tered and locally distributed environment. Moreover, to the best of our knowl- 
edge, none of them exploits the support for the flexible and fine-grained pro- 
grammable network control offered by SDN. 
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Enlarging the focus to Big Data applications, of which DSP applications 
represent the real-time or near-real-time constituent, SDN is considered as a 
promising paradigm that can help to address issues that are prevailing with 
such a kind of applications [11,37]. These issues comprise data processing and 
resource allocation in locally and geographically distributed data centers, includ- 
ing micro data centers in Fog and edge computing, data delivery to end users, 
a joint optimization that addresses the tight coupling between data movement 
and computation, and application scheduling and deployment. 

So far, in the Big Data scenario, most works have leveraged SDN to opti- 
mize the communication-intensive phase of Hadoop MapReduce [15] by placing 
MapReduce tasks close to their data, thus reducing the amount of data that must 
be transferred and therefore the MapReduce job completion time [29,38, 43, 44]. 
A first work that explores the tight integration of application and network con- 
trol utilizing SDN has been presented by Wang et al. [43], which explores the idea 
of application-aware networking through the design of an SDN controller using 
a cross-layer approach that configures the network based on MapReduce job 
dynamics at runtime. The Pythia system proposed by Neves et al. [29] employs 
communication intent prediction for Hadoop and uses this predictive knowledge 
to optimize at runtime the network resource allocation. The Pythia network 
scheduling component computes an optimized allocation of flows to network 
paths and, similarly to the QoS routing in our SIDF architecture, maps the log- 
ical flow allocation to the physical topology and installs the proper sequence of 
forwarding rules on the network switches. Xiong et al. propose Cormorant [44], 
which is a Hadoop-based query processing system built on top of SDN, where 
MapReduce optimizes task schedules based on the network state provided by 
SDN and SDN guarantees the exact schedule to be executed. Specifically, SDN is 
exploited to provide the current snapshot of the network status and to install the 
network path having the best available bandwidth. Their experimental results 
show a 14-38% improvement in query execution time over a traditional app- 
roach that optimizes task and flow scheduling without SDN collaboration. Qin 
et al. in [38] propose a heuristic bandwidth-aware task scheduler that combines 
Hadoop with the bandwidth control capability offered by SDN with the goal to 
minimize the completion time of MapReduce jobs. 

The integration of SDN into the control loop of self-adaptive applications has 
been studied by Beigi-Mohammadi et al. [3] with the goal of exploiting network 
programmability to meet application requirements. This is a new trend in the 
design of self-adaptive systems. We also explore it with the SIDF architecture: 
the integration of SDN allows us to adapt at runtime the stream routing so 
that the QoS requirements of the DSP application can still be guaranteed when 
network operating conditions change. Besides the SDN appealing features, the 
strict cooperation between adaptive systems and the SDN controller might easily 
become a scalability bottleneck. Indeed, SDN controller are often implemented as 
a single centralized entity, whereas adaptive systems can span over geographically 
distributed infrastructures. Further research investigations are needed to enable 
the exploitation of SDN features in a scalable manner. 
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5 Context-Aware Composition of Big Services 


Big services are typically composed of smaller web services or microservices, 
each with multiple alternative deployments to ensure performance, scalability, 
and fault-tolerance. Such service compositions enable the design and implemen- 
tation of complex business processes, eScience workflows, and Big data applica- 
tions, by aggregating the services. Services are often implemented using several 
approaches, languages, and frameworks still offering the same API, standardized 
as RESTful or Service Oriented Architecture (SOA) [30] web services. 

As the demand for QoS and data quality is on the rise, along with the ever- 
increasing scale of Big data, service compositions execute in computational nodes 
that are geographically distributed in the Internet-scale. SDN can be extended 
and leveraged to manage the underlying network that interconnects the build- 
ing blocks of such complex workflows, to enhance the scalability and potential 
use cases in services computing. An integration of SDN and NFV into service 
composition facilitates efficient context-aware distribution of service execution 
closer to the data, minimizing latency and communication overhead. 


5.1 Software-Defined Service Composition (SDSC) 


SDSC is an approach to a distributed and decentralized service composition, 
which leverages SDN for an efficient service placement on the service nodes. 
Following the SDSC approach, a typical eScience workflow is mapped onto a 
geographically distributed service composition. SDSC exploits both the data- 
as-a-service layer and network layer for the resource allocation. System admin- 
istrators can monitor the health of the service compositions, through the web 
service engines that host the services, by observing the runtime parameters such 
as the executed requests and the requests on the fly can be monitored. The list 
of multiple web service deployments can be retrieved from the web service reg- 
istry. In addition to these, SDSC leverages the global network knowledge of the 
SDN controller to find the network parameters such as bandwidth utilization to 
fine tune the services placement, offering features such as congestion control and 
load balancing, which can better be achieved in the network layer. 

By separating the execution from the data plane of the overall system, SDSC 
facilitates integration and interoperability of more diverse implementations and 
adaptations of the services. A resilient execution of service composition can be 
guaranteed through the network management capabilities offered by SDN, in 
finding the best alternative among various service implementations and deploy- 
ments among the multiple potential services deployments for the service compo- 
sition execution. SDSC thus facilitates an increased control over the underlying 
network, while supporting the execution from various traditional web services 
engines and the distributed execution frameworks. 

The core of SDSC is constituted by the communication between inter- 
domain SDN controllers, facilitated by various Message-Oriented Middleware 
(MOM) [12] protocols such as AMQP and MQTT. The service requests are 
mapped to the network through SDN, and the resource provisioning is managed 
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with the assistance of the SDN controller. Hence, each domain is aware of the 
services that are served by the services hosted in them. By offering communica- 
tion between inter-domain controllers, resources are allocated efficiently for each 
service request. 

There is an increased demand for configurability to service composition. 
Context-aware service composition is enabled by exploiting SDN in deploying 
service compositions. The Next Generation Service Overlay Network (NGSON) 
specification offers context-aware service compositions by leveraging virtualiza- 
tion [20]. SDN and NFV support context-awareness and traffic engineering capa- 
bilities [34], to manage and compose services. Research efforts focus on efficient 
resource utilization as well as enabling pervasive services [23] motivated by the 
standardization effort of NGSON. 


5.2 Componentizing Data-Centric Big Services on the Internet 


Workflows of mission-critical applications consist of redundancy in links and 
alternative implementations and deployments in place, either due to parallel 
independent developments or developed such to handle failures, congestion, and 
overload in the nodes. Distributed cloud computing and volunteer computing 
are two examples that permit multi-tenant computation-intensive complex work- 
flows to be executed in parallel, leveraging distributed resources. 

Figure5 represents a multi-tenant cloud environment with various tenants. 
The tenants execute several big services. Many aspects such as locality of the 
executing cloud data center and policies must be considered for an efficient 
execution of the service workflow. An SDN controller deployment can ensure QoS 
to the cloud, by facilitating an efficient management of the network-as-a-service 
consisting of SDN switches, middleboxes, and hosts or servers. The controller 
communicates with the cloud applications through its northbound API, while 
controlling the SDN switches through its southbound API. Thus, SDN facilitates 
an efficient execution of big services. 

In practice, no complex big service is built and deployed as a singleton or a 
tightly coupled single cohesive unit. Mayan [21], which is a distributed execution 
model and framework for SDSC, defines the services that compose a big services 
workflow as the “building blocks” of the workflow. SDSC aims to extend the 
SDN-enabled service execution further to the Internet-scale. 


Representation of the Model. We need to consider and analyze the poten- 
tial execution alternatives of the services, to support a context-aware execution 
of service compositions. In this section, we formally model the big services as 
service compositions and consider the potential execution alternatives for their 
context-aware execution. Services are implemented by various developers follow- 
ing different programming languages and paradigms. 

Yn € Zt;Va € {A,B,...,N}:s" represents the at” implementation of 
service s”. 

Each implementation of a service can have multiple deployments, dis- 
tributed throughout the globe, either as replicated deployments or independent 
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Fig. 5. Network- and service-level views of a multi-tenant cloud. 


deployments by different edge data centers. These multiple deployments facili- 
tate a bendwadtle: efficient execution of the services. 

Ym € Zt: s? „ represents the mt” deployment of s? 

Each service can be considered a function of a waiving number of input 
parameters. Any given big service S can be represented as a composite function 
or a service composition. These service compositions are composed of a subset 
of globally available services. 

Vx €Z+, x <n; S=stos?o...08°*. 

The minimum number of execution alternatives for any service can be rep- 
resented by Kg, where: 

N 

Vs E€ Sike = S> Ma. 

a=A 

Here, N different service implementations and a varying number Ma of 
deployments for each implementation of s are considered. 


Minimum and Maximum Execution Alternatives. Now we will formalize 
the maximum and minimum execution alternatives for any service composition, 
considering the multiple implementations or deployed replicas of the same ser- 
vice. More execution alternatives will offer more resilience and scalability to the 
service composition. 

ng represents the number of alternative execution paths for each big service S. 
The service that has the minimum alternatives limits the minimum number of 
potential alternatives for a service composition. 

ns > min(kr: x <n) > 1. 
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Taking into account the alternatives due to various service combinations in 
the big service, the maximum alternatives is limited by a product of alternatives 
for each service. 

ns < ar Kr. 

Hence, 

min(ks: x < n) < ns <4 £e- 

Various protocols and web services standards unify the message passing 
between the services, and enable seamless migration among the alternatives, 
in a best-effort and best-fit strategy. SOA and RESTful web services support 
common message formats through standardizations. These efforts unify and rev- 
olutionize the way services are built on the Internet. 


5.3 Illustrative Example 


Figure 6 illustrates a sample workflow that represents a service composition. This 
workflow can be an eScience workflow or a complex business process. The work- 
flow represents multiple possible execution paths when the service composition 
is decomposed or componentized into services (Services 1, 2,..,n). A, B, C, .., Z 
represents the alternative implementations for each of the services. Thus, service 
implementations such as 1A, 1B, and 1Z can function as an alternative to each 
other (here, each of these is an implementation of service 1). 

As illustrated by Fig.6, if service 3A is either congested or crashed, the 
service execution can be migrated to the next best-fit (chosen based on locality 
or some other policy) deployment 3B. (2,3)Z represents a service that is equal to 
the service composition of 3A(2A), the output of 2A as an input to 3A. Hence, it 
is not an alternative to 2A or 3A. It is also possible that not all the services have 
alternative deployments in considered environments (as indicated by the lack of 
Service 2 as in 2C). Service deployment details need to be specified in the service 
registry to be able to compose and execute the service workflows seamlessly. 


5.4 eScience Workflows as Service Compositions 


The Internet consists of various data-centric big services. Complex eScience 
workflows leverage multiple big services for their execution and can be decom- 
posed into various geo-distributed web services and microservices. eScience work- 
flows can, therefore, be represented by service compositions. Thus, these big 
services, centered around big data, can be expressed into simpler web services, 
which can be executed in a distributed manner. 

Mayan seeks to find the best fit among the alternatives of available service 
execution options, considering various constraints of network and service level 
resource availability and requirements, while respecting the locality of the service 
requests. Mayan proposes a scalable and resilient execution approach to offer 
a multi-tenant distributed cloud computing platform to execute these services 
beyond data center scale. 

Mayan enables an adaptive execution of scientific workflows through feder- 
ated SDN controllers deployed in a wide area network. Hence, Mayan leverages 
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Fig. 6. Simple representation of multiple alternative workflow executions. 


the various potential alternative execution paths existing between the service 
compositions, while exploiting the network knowledge of the SDN controller. 
Furthermore, Mayan utilizes the local workload information available at the 
web service engine and web services registry. The information received from 
this services layer includes web service requests on the fly and web services 
served at any time by the web service deployment. As an implementation of an 
SDSC, Mayan exploits both the control plane and services plane in offering a 
load-balanced, scalable, and resilient execution of service compositions. Mayan 
leverages OpenDaylight’s data tree as an efficient control plane data store while 
using an AMQP-based messaging framework to communicate across multiple 
network domains in service resource allocation. 


5.5 Inter-domain SDN Deployments 


The SDN architecture needs to be extended for an Internet-wide service compo- 
sition. A global view of the entire network hierarchy may not even be feasible 
to achieve for a single central controller due to the organizational policies. An 
inter-domain SDN deployment is necessary to cater for this scale and segregation 
of the network. Here, each domain (that can represent a cloud, organization, or 
a data center) is orchestrated by an SDN controller cluster. 

The clustered deployment prevents the controller from becoming a single 
point of failure or a bottleneck. As eScience workflows are deployed on a global 
scale, a federated deployment of controller clusters is leveraged to enable com- 
munications between inter-domain controller clusters, without sharing a global 
network view. The federated deployment allows network level heuristics to be 
considered beyond data center scale, using MOM protocols in conjunction with 
SDN. Inter-domain controllers communicate through MOM messages between 
one another. Hence, SDN controllers of different domains have protected access 
to data orchestrated by one another, based on a subscription-based configuration 
rather than a static topology. 
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Some research work has previously leveraged federated SDN controller 
deployments for various use cases. CHIEF [22] presents a scalable inter-domain 
federated SDN controller deployment for wide area networks, as a “controller 
farm”. It builds a large-scale community cloud orchestrated by various indepen- 
dent controller clusters sharing data through a protected MOM API. Such con- 
troller farm may support collaboration between multiple organization networks, 
otherwise limited from network-level coordination. SDSC can be extended to cre- 
ate a Service Function Chaining (SFC), that is an ordered sequence of middlebox 
actions or VNFs such as load balancing and firewall. 


6 Benefits and Open Issues 


Network virtualization and programmability of network resources enable 
dynamic creation of service chains that satisfy QoS demands of complex ser- 
vices at runtime. Runtime control of traffic and usage of network resources is 
provided from infrastructure to control layer thus enabling runtime management 
decisions. Abstracting the network infrastructure plane is a movement similar 
as introducing higher levels of abstraction into programming languages. The 
key benefit of such abstraction is enabling less experienced developers to eas- 
ier program new applications, using abstract objects of network resources, with 
the help of formal programming frameworks and environments. The risks of pro- 
grammer faults are minimized through formalisms implemented in programming 
languages. The main benefit is in offloading new application developers of very 
complex network skills, thus opening application development even to not skilled 
people and innovation opportunities to the wider community. Abstraction of net- 
work resources will benefit with opening innovation opportunities based on the 
use of unlimited network resources. 

A direct consequence of opening network resources to wider developers com- 
munity is in accelerating the process of offering new features to end users and 
minimizing development costs. Another result of abstraction is the introduction 
of standard interfaces that enable evolution and change of each layer indepen- 
dently. Contrary to traditional networks where there is a dominant vendor lock- 
in solutions, in new network architecture, with introduced standard application 
platform interfaces between network layers, the independence to provider equip- 
ment has opened numerous opportunities for innovation by using an unlimited 
poll of network resources and services offered by various networks. 

Furthermore, the programmable network enables numerous possibilities for 
network automation. New service management models may be developed at each 
network layer independently with runtime control of network resources. These 
may be used to autonomous control efficiency of network resource use while 
addressing specific QoS requirements of the particular application. 

Nowadays, service compositions and Big Data applications must deal with 
changing environments and variable loads. Therefore, to guarantee acceptable 
performance, these applications require frequent reconfigurations, such as adjust- 
ments of application component placement or selection of new services. In this 
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respect, SDN capability of programming, the network at runtime allows a cross- 
layer management of computational and networking resources, thus enabling a 
joint optimization of application placement (or service composition) and data 
stream routing. The cross-layer management can be beneficial especially in geo- 
distributed environments, where network resources are heterogeneous, subject 
to changing working conditions (e.g., congestion), and characterized by non- 
negligible communication delays. In an SDN environment, the application control 
layer (e.g., service composition broker, DSP framework) can regard the network 
as a logical resource, which can be managed as a computing resource in a virtu- 
alized computing environment. Specifically, the programmability allows to auto- 
mate and control the network so to adjust its behavior as to fulfill the applica- 
tion needs. For example, multiple paths or paths with specific QoS attributes can 
be reserved for transmitting data, data streams can be redirected during applica- 
tion components downtime, or network devices can be programmed to carry out 
new functions. Moreover, the use of standardized interfaces between the applica- 
tion layer and network controller (i.e., Northbound APIs) allows simplifying the 
implementation and utilization of new network services (e.g., QoS-based routing). 

With respect to the integration of SDN and Big Data and specifically to 
the SIFD architecture presented in Sect.4, we observe that when the network 
controller in SDN is used for Big Data applications, its performance could be 
degraded due to the rapid and frequent flow table update requests which might 
not be sustained by today SDN controllers. The problem is exacerbated if the 
controller serves multiple applications/frameworks as it can easily become the 
performance bottleneck of the entire architecture. To this end, we need to define 
solutions which cater for the presence of multiple applications, with possible 
diverse and conflicting QoS requirements by defining policies which ensure fair 
usage of network resources in the face of competing resources requests. The prob- 
lem becomes relevant in large-scale distributed environments, where a centralized 
approach might not scale, and distributed solution becomes preferable. 

New service development formalisms may be required to standardize processes 
at the network management level. In the future use of such a programmable net- 
work environment, a network is seen as an unlimited pool of resources. So, it is 
expected a significantly increase in the network use with a number of new and 
innovative services. Such increase in diversity of network services and a number 
of new application interfaces would need to redefine service development and man- 
agement models. New design principles would be needed, and this need would be 
recognized with increased diversity at network application layer. For such pur- 
poses, there is a need for new developments in formal methods for introducing the 
controlled behavior in programming network. Development of network compilers 
is ongoing research activity for these purposes. Furthermore, new mathematical 
models are needed that would be able to describe network behavior. There is a 
need for some generative models that can predict the parameters from the internal 
properties of the processes we are controlling. Such models would not only bring 
efficiency in processing network control algorithms, but would also be stimulating 
phenomena in network behavior. 
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7 Conclusions 


In this chapter, we looked into how SDN and NFV enable QoS-aware service 
compositions, and how SDN can be leveraged to facilitate cross-layer optimiza- 
tions between the various network and service layers. So far, SDN has been 
largely and separately exploited mainly in telecommunication environments. For 
example, NFV placement and SDN routing for network embedding have been 
used to achieve energy efficiency as explained in Sect. 3. However, there is an 
increasing interest in exploring the network control opportunities offered by 
SDN in the Big Data context, as discussed for the deployment of DSP appli- 
cations on the underlying computing and networking resources. In the use case 
presented in Sect. 4, SDN is used to expose to the service management layer the 
network topology and network-related QoS metrics. The service management 
layer determines both the application components placement on the underlying 
computing resources and the network paths between them. In this way, SDN 
allows autonomous adjustment of the network paths as per-application needs. 
Furthermore, in Sect. 5 we provided an example of using SDN for the design and 
implementation of complex scientific and business processes. 

Through these three examples, we presented different deployment manage- 
ment decisions for service compositions over the layers of a network architecture 
that integrates SDN and NFV. As future research direction, we identify the need 
for the development of an autonomous management framework that can coordi- 
nate cross-layer decisions taken by different management layers while deploying 
service compositions that satisfy QoS guarantees in an Internet-scale distributed 
network. Future work is also needed to investigate the side effects that may arise 
from the coordination among management decisions at different layers. 
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Abstract. Network Performance (NP)- and more recently Quality of 
Service/Experience/anything (QoS/QoE/QoX)-based network manage- 
ment techniques focus on the maximization of associated Key Perfor- 
mance Indicators (KPIs). Such mechanisms are usually constrained by 
certain thresholds of other system design parameters. e.g., typically, 
cost. When applied to the current competitive heterogeneous Cloud Ser- 
vices scenario, this approach may have become obsolete due to its static 
nature. In fact, energy awareness and the capability of modern technolo- 
gies to deliver multimedia content at different possible combinations of 
quality (and prize) demand a complex optimization framework. 

It is therefore necessary to define more flexible paradigms that make 
it possible to consider cost, energy and even other currently unforeseen 
design parameters not as simple constraints, but as tunable variables 
that play a role in the adaptation mechanisms. 

In this chapter we will briefly introduce most commonly used frame- 
works for multi-criteria optimization and evaluate them in different 
Energy vs. QoX sample scenarios. Finally, the current status of related 
network management tools will be described, so as to identify possible 
application areas. 


1 Introduction 


Network Performance- and more recently Quality of Service/Experience/ 
X-based network management techniques (where “X” can represent “S” service, 
“P” perception, “E” experience or “F” flow, just to give a few examples), focus on 
the maximization of associated KPIs. Such mechanisms are usually constrained 
by certain thresholds of other system design parameters, e.g., typically, cost. 
When applied to the current competitive heterogeneous Internet of Services sce- 
nario, this approach may have become obsolete due to its static nature. In fact, 
© The Author(s) 2018 
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energy awareness and the capability of modern technologies to deliver multime- 
dia content at different possible combinations of quality (and prize) demand a 
complex optimization framework. 

It is therefore necessary to define a more flexible paradigm that makes it 
possible to consider cost, energy and even other currently unforeseen design 
parameters not as simple constraints, but as tunable variables that play a role in 
the adaptation mechanisms. As a result, for example, the service supply will then 
search for the maximum QoE at the minimum cost and/or energy consumption. 
In consequence, a certain service will not be offered at a single and specific 
guaranteed price, but will vary with the objective of obtaining the best (QoE, 
cost, energy, etc.) combination at a given time. 

Unfortunately, most considered design parameters are conflicting, and there- 
fore the improvement of one of them entails some deterioration of the others. 
In these circumstances, it is necessary to find a trade-off solution that optimizes 
the antagonistic criteria in the most efficient way. Therefore, the resource allo- 
cation problem becomes a multi-criteria optimization problem and the relevance 
of each criteria gains uttermost importance. 

This chapter analyzes the existing optimization frameworks and tools and 
studies the complexity of introducing utility functions into network /management 
mechanisms, including fairness considerations. Then, we present cost /energy /*- 
aware network and cloud services management scenarios. Finally, we address the 
challenge of introducing energy-awareness in network controlling mechanism and 
provide a general view of current technologies and solutions. 


2 Dealing with Multi-criteria Optimization: Frameworks 
and Optimization Tools 


Regardless the mathematical or heuristic tools applied in order to find (near) 
optimal solutions in the scope of Internet of Services management mechanisms, 
all of them share common issues due to the extension of the original definition 
of the problem to a multi-criteria one. This section provides a summarized com- 
pilation of those issues, especially those related to how the decision maker (DM) 
will take into consideration different antagonistic criteria. 


2.1 Generic Definition of the Problem 


The classical constrained single criteria problem deals with finding the combina- 
tion of design parameters (normally represented by a vector «*) in the feasible 
space (S) that minimize a single function (1). 


da* € S / minf(a*) =z (1) 


Then the multi-criteria or multi-objective optimization problem, defined as 
an extension of the mono-criteria one, aims at simultaneously minimizing a col- 
lection of requirements keeping the equality and inequality constraints of the 
feasible space (2). 
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wv € S / min filz“) = 2zVi=1,2,...,k (2) 


The optimal solution that minimizes simultaneously all the criteria is most of 
the times hardly achievable, and is known as utopian solution [5]. Therefore, the 
actual best solution of the problem should be as close as possible to this utopian 
solution. The optimization problem must then be redefined to extract from the 
whole feasible space of solutions, those closer to the utopian solution. That set 
of solutions characterizes the Pareto-optimal front. The goal of a good multi- 
criteria optimization problem is the search of a set of solutions that properly 
represents that Pareto front, i.e., uniformly distributed along that Pareto front. 

However, due to the trade-offs among different parameters, in most of the 
cases there will not exist such a solution which minimizes all the criteria simul- 
taneously. So, the nature of the problem is usually re-defined by introducing the 
concept of Utility Function, responsible for quantifying the relevance and com- 
posite articulation of different criteria. Then, the real formulation of the problem 
can be expressed mathematically as follows (3). 


da* € S / minU (21, 22,..., 2k) (3) 


2.2 Incorporating Multiple Criteria in General Optimization 
Methods 


Multiple Objective Optimization (MOO) has been a field of intensive research 
in different engineering areas. This activity has led to the development of a lot 
of MOO methods ranging from exact methods to meta-heuristics and including 
several different nature algorithms. 

In this section, we propose a comprehensive taxonomy of the optimization 
problem synthesized from the works in [13,14,21,30,31,36]. The presented tax- 
onomy categorizes the optimization problems according to different perspectives 
where the main goal is to determine how the multiple criteria are considered 
by the DM. Table 1 summarizes the characterization of the optimization criteria 
that are defined as follows: 


— Qualitative vs. quantitative criteria: refers to how the analyzed criteria 
are measured. If the DM is able to represent the preference degree of one 
option against the others by a numerical value, then the criteria are quan- 
titative. Otherwise, the criteria are qualitative, meaning that preference can 
not be numerically measured or compared and, in consequence, a descriptive 
value is assigned. 

— Preference articulation: refers to the point in time the DM establishes its 
preferences: 

e A priori preference articulation: the preferences are defined at prob- 
lem modeling stage, adding supplementary constraints to the problem 
(i.e., weighted sum and lexicographic methods). 

e A posteriori preference articulation: once the optimization problem 
provides the set of results from the optimization process, DM’s preferences 
are used to refine the final solution (i.e., in evolutive and genetic methods). 
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e Progressive preference articulation: DM’s preferences are gradually 
incorporated in an interactive way during the optimization process. 

e Without preference articulation: when there is no preference defini- 
tion for the problem (i.e., max-min formulation, global criterion method). 

— Continuous vs. discrete: refers to the variable type used to describe the 
optimization criteria. When the optimization problem handle discrete vari- 
ables, such as integers, binary values or other abstract objects, the objective 
of the problem is to select the optimum solution from a finite, but usually 
huge, set. On the contrary, continuous optimization problems handle infinite 
variable values. In consequence, continuous problems are usually easier to 
solve due to their predictability, because the solution can be achieved with 
an approximate iterative process. Since cost/energy aware network and ser- 
vices management must deal with both discrete (i.e., number of servers, route 
lengths, radio bearers, etc.) and continuous design parameters (i.e., coding bit 
rate, transmission power, etc.) both techniques should be considered. 

— Constrained vs. not constrained: refers to the possibility of attaching 
a set of requirements expressed through (in)equality equations to the opti- 
mization problem. In this case, besides finding a solution that optimizes a 
collection of criteria, it must also meet a set of constraints. Non constrained 
methods can be used to solve constrained methods, replacing restrictions for 
penalizations on objective functions to prevent possible constraint violations. 
As aforementioned, classical network management approached involved con- 
sidering a single criteria only and establishing Cost and Energy constrains. 
The proposal in ACROSS to move to a multi-criteria optimization analysis 
does not necessarily imply getting rid of all the possible constraints. 


Those classifications do not result into disjoint categories. In fact, multi- 
criteria optimization problems in the considered heterogeneous network and ser- 
vices management scenario may fall into one or several of the categories listed 
above. 

Summarizing, before beginning with the process of multi-criteria optimiza- 
tion problem there is a crucial previous step: the definition of the criteria to be 
optimized, i.e., the preferences of the DM about the suitability of the obtained 
solution. 

Regardless the decision maker being the Cloud/Network/SOA designer or 
service operator the adaptation algorithm must incorporate the impact of dif- 
ferent criteria on their perception of the goodness of any solution. A key factor 
in the analysis for decision making is indeed the fact that the functions that 
model decision maker’s preferences (criteria or objective functions) are not usu- 
ally known a priori. 


2.3 Complexity of Defining Multi-criteria Utility Functions 
to be Incorporated in Network/Management Mechanisms 


Considering the relevance of the choice of a multi-criteria utility function, dif- 
ferent tools aiding at this task will be reviewed in this section. 
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Table 1. Characterization of the optimization criteria for DM. 


Description type Qualitative 
Quantitative 


Preference articulation A priori 
A posteriori 
Progressive 


None 


Type of variables Continuous 
Discrete 


Constraint definition | Constrained 


Not constrained 


— Goal attainment 

— MAUT (Multi-Attribute Utility Theory) 
— Preference relations 

— Fuzzy logic 

— Valuation scale 


Goal Attainment. This basic format restricts the feasible space with the most 
relevant set of alternatives according to the DM’s preferences (Fig. 1). Such pref- 
erences if represented mathematically usually result in a n-dimensional shape or 
contour in the decision space limiting those solutions acceptable by the DM 
(similar to that imposed by the constraints in the design space). It is a simple 
and direct format, that just splits alternatives into relevant /non-relevant groups. 
However, it only offers little information about preferences, not providing any 
hint about the predilections of the DM. 


Fig. 1. Selection set within feasible space. 


The work in [8,9] describe the use of goal attainment preference modeling in 
multi-criteria algorithms. 
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Multi-Attribute Utility Function. In this case, the utility function is build 
by describing the repercussion of an action regarding a specific criterion. Each 
action is assigned a numerical value, so that the higher the value, the more 
preferable the action. Then, the assessment of an action becomes the weighted 
sum of the numerical values related to each considered criterion. This represen- 
tation format is capable of modeling DM’s preference more precisely than the 
Selection Set. Nontheless, it also means that the DM must evaluate its incli- 
nations globally, comparing each criterion against all the others, which is not 
always possible. Therefore, this type of utility function is suitable just for the 
cases in which a perfect global rationality can be assumed [46]. For example, this 
classic model is commonly employed in economics and welfare field. 

Besides, utility has an ordinal nature, in the sense that the preference relation 
between the possible choices is more significant than the specific numerical values 
[4]. So, this leaves the door open to discarding the numerical value of the utility, 
as it is shown next. 


Preference Relations. This representation format models the inclination over 
a set of possible choices using a binary relation that describes the qualitative 
preference among alternatives. Then, a numerical value is linked to that rela- 
tion, defining the preference degree of alternative x; against alternative x, in a 
quantitative way [37]. 

This format of preference modeling provides an alternative to the assign- 
ment of a numerical value to different utility levels, allowing the comparison of 
alternatives pairwise, providing the DM higher expressiveness to enunciate his 
preferences (i.e., similar to Analytic Hierarchy Process — AHP [39]). Outranking 
methods employ this format of preference representation. 


Fuzzy Logic. This format allows the introduction of uncertainty over the pref- 
erences under analysis. In order to avoid ambiguity in the definition process of 
the preferences, each “x; is not worse than x” is attached a credibility index. 
In this sense, fuzzy logic becomes a useful tool [46], as a general framework for 
preference modeling where certain sentences are a particular case. 

The obstacle of using fuzzy logic with credibility indexes is the weakening 
of the concept of truth. The infinite possible values of truth between absolute 
truth and falseness have an intuitive meaning that does not correspond to their 
formal semantics. In addition, there are other problems, such as the formulation 
of the credibility index itself. 


Valuation Scale. This preference formulation defines a formal representation 
of the comparison between possible choices that expresses both the structure of 
the described situation and the variety of manipulations that can be made on it 
[37]. This type of sentences are appropriately expressed in logical language. But 
classical logic can be too inflexible to acceptably define expressive models. In 
consequence, other formalisms must be taken into account to provide the model 
with the required flexibility. 


Energy vs. QoX Network- and Cloud Services Management 247 


Conclusion. Preference or criteria description format plays a crucial role in 
the definition of the nature and structure of the information the DM employs to 
set his predilections up towards the different possibilities. The selection of the 
best representation format will rely on the characteristics of the specific area 
of expertise. Sometimes, inclinations will be better expressed using numerical 
values, and in other cases using more natural descriptions, such as words or 
linguistic terms. 

The final goal is to contrast the impact of the potential actions with the 
purpose of making a decision. Therefore, it is necessary to establish a scale for 
every considered criterion. The elements of the scale are denoted degrees, levels 
or ranks. 


Table 2. Summary of methods to define multi-criteria utility functions ordered by 
complexity. 


Goal attainment Simple, just relevant /non-relevant categorization of 
preferences 

Multi-Attribute Utility | Utility function as a weighted sum of numeric values 

Theory (MAUT) assigned to criteria, needs perfect global rationality 

Preference relations Modeled through binary relations to define preferences 
pairwise 

Fuzzy logic Introduces uncertainty through a credibility index 

Valuation scale Establishes a formal representation of the preference 


between alternatives 


Table 2 summarizes the aforementioned methods to define multi-criteria util- 
ity functions represent the preferences of the decision maker related to the mul- 
tiple criteria to be optimized. This table also orders them according to its com- 
plexity, starting with the simpler Goal Attainment method and ending with the 
completer Valuation Scale. 


2.4 Multi-criteria Problems Solving Mechanisms 


Once the optimization problem is modeled or formulated, the solution is found 
after the application of an optimization method. 

Most optimization algorithms frequently imply an iterative searching pro- 
cess. Beginning with an initial approach to the solution, the algorithm performs 
consecutive steps towards the termination point. The search strategy states the 
difference among the diverse methods and there is no universal method applica- 
ble to any kind of problem. Table 3 shows a classification of the main optimization 
solving families. 
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Table 3. Classification of optimization solving methods. 


Weighted sum [21] The multiple objective functions are aggregated in a 
single function by the assignment of weights 


Random search [13] Generate random numbers to explore the search 
(feasible) space 


Tabu search [14] Iteratively make movements around the current 
solution constrained by a group of forbidden or tabu 
movements 


Physical programming [31] | Incorporate preferences without the need of weight 
assignment. Address both design metrics and 
constraints in the same way, integrating them into the 
utility function 


Lexicographic [6] Objective functions are processed in a hierarchical basis 


Genetic and evolutionary | Imitate the optimization process of the natural 
[12,15] selection. Employ techniques such as heredity, 
mutation, natural selection or factor recombination to 
explore the feasible space and select the current 
solution 


Simulated annealing [14] Imitate the iterative process of cold and heat 
application for metal annealing by increasing or 
decreasing the difference between the ideal solution 
and the current approach 


Ant colony optimization Imitate animal behavior related to their intra-group 
(ACO) and swarm communication or their search for the optimal ways 
optimization [40] towards the food 

Outranking methods [46] Build an ordered relation of the feasible alternatives 


based on the defined preferences over a set of criteria 
to eventually complete a recommendation 


2.5 Fairness Consideration 


Traditionally, the goal of any optimization problem has been the search for the 
optimum solution for a given situation among all the possible ones in the feasible 
solution space. This optimality meaning has often been understood as a Pareto 
Optimum, i.e., the result of the maximization/minimization of the objective 
functions (or criteria), where the result of none of the objective functions can be 
improved, but at the expense of worsening another one. Finding a Pareto-optimal 
solution means finding the technically most efficient solution. And applying this 
concept to the field of networking, this optimality results on the optimum dis- 
tribution of resources among the flows traveling through the network. 
Obviously, an optimal distribution of resources not always implies an equi- 
table use of them. Indeed, in some cases it may lead to absolutely unfair sit- 
uations that entail the exhaustion of some resources. In that sense, the effi- 
cient assignment of resources derived from the direct application of optimization 
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algorithms may leave without service some customers or final users, due to the 
provision of all the benefit to others (see examples in [9,38]). Obviously, the 
global utility of the system is the maximum, but the result is clearly unfair, and 
the situation worsens as the heterogeneity of the final users increases. 

The conflict between the maximization of the benefit, the optimal resource 
allocation and the fairness of the distribution is a field that has been widely 
analyzed in Economy, as part of microeconomics or public finances. The conclu- 
sion is that the incompatibility between fairness and efficiency is not a design 
problem of the optimization algorithms, but of the formulation of the problem to 
be optimized, where the fairness concept must be included. The difficulty rises 
up since efficiency is an objective or technical goal that, in consequence, can be 
measured and assessed quantitatively. This has nothing to do with the concept 
of fairness, a subjective concept whose assessment is not trivial. 

Although fairness may initially seem to be easy to define, it has a variety of 
aspects that complicate its proper delimitation. Taking the sense of equanimity, 
an equitable distribution of resources could be defined as an evenly split available 
resource assignment among the flows competing for them. The disadvantage of 
this distribution is that it does not take into account the specific necessities of 
each flow. If all the flows obtain the same portion of resources, those with lower 
requirements benefit from a proportionally higher resource quantity. 

Changing the definition of fair distribution to that assigning the resources 
proportionally on the basis of flow requirements is neither the ideal solution. In 
this case, the most consuming items are benefited, i.e., those which contribute 
more to the network congestion, to the detriment of lighter transmissions and 
consequently, of the global performance of the network. 

In addition, other aspects such as cooperation must also be considered. There 
may be some nodes in the network not willing to give up their resources to 
other transmissions, and so, this kind of behavior should be punished. But, what 
happens when a node doesn’t give up resources to the network due to the lack 
of them? It would be the case of a node with low battery or low capacity links. 
Would these be reason enough to reduce the transmission resources that have 
been assigned? In this case, would the distribution be fair? This conflict remains 
unsolved, although some approaches have been formulated and are discussed 
next. 

The work in [8] presents several interpretations of the concept of fairness. In 
one hand, there is the widely accepted max-min fairness definition [38], usually 
employed in social science. It is based in the search for consecutive approaches 
to the optimum solution in a way that no individual or criterion can improve its 
state or utility if it means a loss for a weaker individual or criterion. 

Translating this concept to communications, the distribution of network 
resources is considered max-min fair when all the minimum transmission rates 
of the data flows are maximized and all the maximum transmission rates are 
minimized. It is proven that this fairness interpretation is Pareto-efficient. 

Another interpretation of fairness that also searches for the trade-off between 
efficiency and equity is the proportional fairness [26]. A resource distribution 
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among the network flows is considered proportional when the planned priority 
of a flow is inversely proportional to the estimated resource consumption of this 
flow. It can also be proven that the proportional fairness is Pareto-efficient. 

Both aforementioned interpretations the bandwidth is shared to maximize 
some utility function for instantaneous flows. This means that the optimality of 
the resource assignment is measured for a static combination of flows. Taking into 
account the real random nature of the network traffic, it is necessary to define 
the utility in terms of the performance of individual flows with finite duration. 
And in this case, it is not so clear that the max-min or proportional fairness 
concepts reach an optimum result. With random traffic, the performance and, 
in consequence, the utility depend on precise statistics of the offered traffic and 
are hard, if not impossible, to be analytically assessed. 

Sharing flows under a balanced fairness criterion [9], the performance 
becomes indifferent to the specific traffic characteristics, simplifying its for- 
mulation. The term balanced fairness comes from the necessary and sufficient 
relations that must be fulfilled to guarantee the insensitiveness in stochastic net- 
works. This insensitiveness entails that the distribution of the active flow number 
and, in consequence, the estimated throughput, depends just on the main traffic 
offered in each route. 

Balanced fairness makes it possible to approach the behavior of the elastic 
traffic over the network and, in addition to the insensitiveness property, it also 
makes it possible to find the exact probability of the distribution of concurrent 
flows in different routes and then evaluate the performance metrics. 

The balanced fairness is not always Pareto-efficient, but in the case that 
existing one, it will be one of a kind. 


3 Cost/Energy/*-Aware Network and Cloud Services 
Management Scenarios 


Once most well knows multi-criteria optimization techniques are introduced, the 
next step is to analyze the application scenarios. This section overviews sev- 
eral research scenarios where energy-aware control of different systems has been 
considered as part of the ACROSS project. The scenarios include the following: 
modeling and analysis of performance-energy trade-off in data centers, charac- 
terization and energy-efficiency of applications in cloud computing, energy-aware 
load balancing in 5G HetNets and finally incorporating energy and cost to oppor- 
tunistic QoE-aware scheduling. 


3.1 Modeling and Analysis of Performance-Energy Trade-Off 
in Data Centers 


An increasing demand for green ICT has inspired the queueing community to 
consider energy-aware queueing systems. In many cases, it is no longer enough 
to optimize just the performance costs, but one should also take into account the 
energy costs. An idle server (waiting for an arriving job to be processed) in the 
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server farm of a typical data center may consume as much as 60% of the peak 
power. From the energy point of view, such an idle server should be switched 
off until a new job arrives. However, from the performance point of view, this 
is suboptimal since it typically takes a rather long time to wake the server up. 
Thus, there is a clear trade-off between the performance and energy aspects. 

The two main metrics used in the literature to analyze the performance- 
energy trade-off in energy-aware queueing systems are ERWS and ERP. Both of 
them are based on the expected response time, E[T], and the expected power 
consumption per time unit, E[P]. The former one, ERWS, is defined as their 
weighted sum, w,E[T] + w2E|[P] and the latter one, ERP, as their product, 
E|T] - E[P]. Also, generalized versions of these can be easily derived. 

Here we model data centers as queuing systems and develop policies for the 
optimal control of the performance-energy trade-off. For a single machine the 
system is modeled as an M/G/1 queue. When considering a whole data-center, 
then a natural abstraction of the problem is provided by the dispatching problem 
in a system of parallel queues. 


Optimal Sleep State Control in M/G/1 Queue: Modern processors support many 
sleep states to enable energy saving and the deeper the sleep state the longer is 
the setup delay to wake up from the sleep state. An additional feature in the 
control is to consider if it helps to wait for a random time (idling time) after 
busy period before going to sleep. Possible approaches for the sleep state selection 
policy include: randomized policy, where processor selects the sleep state from 
a given (optimized) distribution, or sequential policy, where sleep states are 
traversed sequentially starting from the lightest sleep state to the deepest one. 
Analysis of such a queuing system resembles that of classical vacation models. 

Gandhi et al. see [17], considered the M/M/1 FIFO queue with deterministic 
setup delay and randomized sleep state selection policy but without the possi- 
bility of the idle timer, i.e., the timer is either zero or infinite, and they showed 
for the ERP metric that the optimal sleep state selection policy is deterministic, 
i.e., after busy period the system goes to some sleep state with probability 1 
(which depends on the parameters). Maccio and Down [29] added the possibility 
of an exponential idle timer in the server before going to sleep, and showed for 
the ERWS cost metrics and for exponential setup delays that the optimal idle 
timer control still sets the idle timer equal to zero or infinite, i.e., the idle timer 
control remains the same. Gebrehiwot et al. considered the more general M/G/1 
model with generally distributed service times, idle timer distributions and setup 
delays, both ERP and ERWS cost metrics (and even slightly more generalized 
ones) and randomized/sequential sleep state selection policies. Assuming the 
FIFO service discipline, it was shown in [20] that even after all the generaliza- 
tions the optimal control finally remains the same: the optimal policy (a) either 
never uses any sleep states or (b) it will directly go to some deterministic sleep 
state and wake up from there. This result was shown to hold for the Proces- 
sor Sharing (PS) discipline in [19] and for the Shortest Remaining Processing 
Time (SRPT) discipline in [18]. Thus, it is plausible that the result holds for any 
work-conserving discipline. 
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Energy-Aware Dispatching with Parallel Queues: The data center can be mod- 
eled as a system of parallel single-server queues with setup delays. The system 
receives randomly arriving jobs with random service requirements. The problem 
is then to identify for each arrival where to dispatch arriving new jobs based 
on state information available about the system, e.g., the number of jobs in the 
other queues. Another modeling approach is to consider a centralized queue with 
multiple servers, i.e., the models are then variants of the multiserver M/M/n 
model. 

In the parallel queue setting and without any energy-aware considerations, 
the optimality of the JSQ policy for minimizing the mean delay with homoge- 
neous servers is one classical result, see [48]. However, in an energy-aware setting 
the task is to find a balance for using enough servers to provide reasonably low 
job delay while taking into account the additional setup delay costs, and to let 
other servers sleep to save energy. Achieving this is not at all clear. For the cen- 
tralized queue approach, Gandhi et al. proposed the delayed-off scheme, where 
servers upon a job completion use an idle timer, wait in the idle state for this time 
before going to sleep, and new jobs are sent to idle servers if one is available or 
otherwise some sleeping server is activated. An exact analysis under Markovian 
assumptions was done in [16], and it was shown that by appropriately selecting 
the mean idle timer value, the system keeps a sufficient number of servers in 
busy/idle state and allows the rest to sleep. An important result has been only 
recently obtained by Mukherjee et al. in [33], which considers the delayed-off 
scheme in a distributed parallel queue setting: it was shown that asymptotically 
delayed-off can achieve the same delay scaling as JSQ, i.e., is asymptotically 
delay optimal, and at the same time leaves a certain fraction of servers in a sleep 
state, independent of the value of the idle timer and the setup delay. This result 
holds asymptotically when the server farm is large with thousands of servers. 

However, in a small/moderate sized data center there is still scope for opti- 
mization. In this setting the use of MDP (Markov Decision Process) and Policy 
Iteration has been recently considered by Gebrehiwot et al. in [28], where the 
data center is assumed to consist of two kinds of servers: normal always-on 
servers and instant-off servers, which go to sleep immediately after queue emp- 
ties, i.e., there are no idle timers, and an explicit near optimal policy is obtained 
for minimizing the ERWS metric that uses as state the number of jobs in the 
queues and the busy/sleep status. Also, size-aware approaches with MDP have 
been recently applied by Hyytia et al. in [24,25]. 


3.2 Characterization and Energy-Efficiency of Applications 
in Cloud Computing 


Modeling Applications. With the goal of improving energy efficiency in cloud 
computing, several authors have studied the different factors that are causing 
energy loss and energy waste in data centers. In [32], the different aspects are 
discussed in detail, and idle runs are discussed as one of the causes for energy 
waste, as already mentioned earlier in this chapter. Low power modes have been 
proposed in the literature both for servers and storage components, however 
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their benefits are often limited due to their transition costs and inefficiencies. 
To improve energy efficiency and reduce the environmental impact of federated 
clouds, in the EU project ECO2Clouds [47] an adaptive approach to resource 
allocation is proposed, based on monitoring the use and energy consumption of 
resources, and associating it to running applications. The demand for resources 
can therefore associated to applications requesting resources, rather than only 
to the scheduling of resources and tasks in the underlying cloud environment. 

Along this line, we have studied within ACROSS how different types of appli- 
cations make use of resources, with the goal of improving energy efficiency. 

As mentioned in Sect. 3.1, to compare different solutions in terms of response 
time and power consumption, the two main approaches are ERWS and ERP. An 
alternative, which allows evaluating energy efficiency at application level, is the 
energy per job indicator. This indicator allows comparing different solutions in 
terms of work performed, rather than on performance parameters, and to discuss 
ways of improving energy efficiency of applications in terms of application-level 
parameters. 

Another aspect which has been considered is that increasing resources is not 
always beneficial in terms of performances, as the systems may present bottle- 
necks in their execution which may cause inefficiencies in the system: in some 
cases, the additional resources will worsen energy efficiency, as the new resources 
are not solving the problem and are themselves underutilized. As a consequence, 
in considering energy efficiency in applications in clouds, some aspects can better 
characterize the use of resources: 


— Shared access to resources: during their execution application can request 
access to shared resources with an impact on energy consumption due to 
synchronization and waiting times. 

— The characterization of the application execution patterns: batch applica- 
tions and transactional applications present different execution patterns: in 
batch applications the execution times are usually longer with larger use 
of resources, but response time constraints are not critical; in transactional 
applications, response times are often subject to constraints and the allocated 
resources must guarantee they are satisfied. 


These application-level aspects have an impact on the resource allocation 
criteria in different cases. In the following, we discuss how to model batch and 
transactional applications considering these aspects with the goal of choosing 
the number of resources to be associated to an application in terms of VMs with 
the goal of minimizing the energy-per-job parameter. 


Batch Applications. Batch applications have been studied in detail in [22] to 
consider the following aspects: number of VMs allocated for executing a batch 
of similar applications, shared resources (in particular shared storage access and 
access synchronization), heterogeneous deployments environments for VMs, with 
servers with different capacity. 

While for the details we refer to [22], we summarize here the main charac- 
teristics of the approach. The general goal is to minimize idle time to improve 
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energy efficiency, while avoiding to increase execution time for each application 
in the batch, which would result in an increase of the total energy. We assume 
that in computing the energy per job, idle time is distributed to all applications 
being run on the system in an equal basis. Queuing models have been developed 
to represent applications, in terms of computing nodes to execute the applica- 
tion and storage nodes for data access, which is assumed to be shared, with 
the possibility of choosing between asynchronous access and synchronous access 
(with synchronization points). In both cases the critical point is represented by 
the ratio between the service time for computing nodes and the service time for 
storage access: going beyond this point the energy per job is increasing without 
significant benefit in execution times. 

An example is shown in Fig.2, where it is clear that increasing the number 
of VMs for an application after the critical point is mainly resulting in a loss of 
energy efficiency, both with synchronous and asynchronous storage access. 


Transactional Workloads. For transactional workloads, the main application- 
level parameter affecting energy consumption is the arrival rate. In fact, assuming 
an exponential distribution of arrivals, if the arrival rate \ is much lower than the 
service time, the idle times will be significant. On the other hand, getting closer to 
service time, the response time will increase, as shown in Fig. 3. The details of the 
computations can be found in [23]. The paper also describes how different load 
distribution policies for VMs can influence energy-per-job. Assuming again that 
idle power is uniformly distributed to all VMs running on the same host, three 
policies have been evaluated: (1) distributing the load equally; (2) allocating 
larger loads to VMs with lower idle power; (3) allocating larger loads to VMs 
with higher idles power. Initial simulation results result in Policy 2 being the 
worst, while Policy 1 and 3 are almost equivalent, with Policy 1 resulting in 
better energy-per-job and Policy 3in better response times [23]. 
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Fig. 2. Energy per job in batch applications 
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Fig. 3. Energy-per-job in transactional applications [23] 


3.3 Energy-Aware Load Balancing in 5G HetNets 


The exponential growth of mobile data still continues and heterogeneous net- 
works have been introduced as a vital part of the network architecture of future 
5G networks. Heterogeneous networks (HetNet) especially alleviate the prob- 
lem that the user data intensity may have spatially large variations. These are 
network architectures with small cells (e.g., pico and femtocells) overlaying the 
macrocell network. The macrocells are high power base stations providing the 
basic coverage to the whole cell area, while the small cells are low power base 
stations used for data traffic hotspot areas within a macrocell to improve spectral 
efficiency per unit area or for areas that the macrocell cannot cover efficiently. 

In HetNets, when a user arrives in the coverage area of a small cell it can 
typically connect to either the local small cell or to the macrocell, as illustrated 
in Fig.4. Typically, the small cells offer in its coverage area a possibility for 
achieving high transmission rates. However, depending on the congestion level 
at the small cell it may be better from the system point of view to utilize the 
resources of the macrocell instead. This raises the need to design dynamic load 
balancing algorithms. In 5G networks the energy consumption of the system 
will also be an important factor. Thus, the load balancing algorithms must be 
designed so that they take into account both the performance of the system, as 
well as the energy used by the whole system. 

Consider a single macrocell with several small cells inside its coverage area. 
The small cells are assumed to have a wired backhaul connection to the Internet. 
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Fig. 4. User inside a femtocell may connect either to the local small cell (femto) or the 
macrocell to achieve better load balancing. 


They typically also operate on a different frequency than the macrocell and hence 
do not interfere with the transmissions of the macrocell. From the traffic point 
of view, each cell can be considered, whether it is the macrocell or a small cell, as 
a server with its own queue each having its own characteristics. The traffic itself 
may consist, for example of elastic data flows. The load balancing problem then 
corresponds to a problem of assigning arriving jobs or users to parallel queues. 
The difference to a classical dispatching problem, where an arrival can be routed 
to any queue, is that in this case the arrival can only select between two queues: 
its own local queue or the queue representing the macrocell. 

In order to include the energy aspects in the model, the macrocell must be 
assumed to be operating at full power continuously. This is because the macrocell 
provides the control infrastructure and the basic coverage in the whole macrocell 
region and it can not be switched off. However, depending on the traffic situation 
it may be reasonable to switch off a low power small cell since the small cells 
typically have power consumption at least an order of magnitude lower than 
the macrocell. The cost of switching off a base station is that there may be a 
significant delay, the so-called set up delay, when turning the base station back 
on again. The queueing models used for the small cells must then be generalized 
to take this into account. 

The resulting load balancing problem that optimizes for example the overall 
weighted sum of the performance and the energy parts of the whole system is 
difficult. However, it can be approached under certain assumptions by using the 
theory of Markov Decision Processes. This has been done recently by Taboada 
et al. in [42], where the results indicate that a dynamic policy that knows the 
sleep state of the small cells and the number of flows when compared with an 
optimized randomized routing policy is better able to keep the small cells sleeping 
and it thus avoids the harmful effect of setup delays leading to gains for both 
the performance and energy parts, while at high loads the energy gain vanishes 
but the dynamic policy still gives a good improvement in the performance. 


Energy vs. QoX Network- and Cloud Services Management 257 


3.4 Incorporating Energy and Cost to Opportunistic QoE-Aware 
Scheduling 


One of the fundamental challenges that network providers nowadays face is the 
management for sharing network resources among users’ traffic flows so that most 
of traditional scheduling strategies for resource allocation have been oriented 
to the maximization of objective quality parameters. Nevertheless, considering 
the importance and the necessity of network resource allocation for maximizing 
subjective quality, scheduling algorithms aimed at maximizing users’ perception 
of quality become essential. 

Thus, to overcome the lacks found in the field of traffic flow scheduling opti- 
mization, during the last years we have analyzed the following three stochastic 
and dynamic resource allocation problems: 


1. Subjective quality maximization when channel capacity is constant [44], 

2. Subjective quality maximization in channels with time-varying capacity [43], 

3. Mean delay minimization for general size distributions in channels with time- 
varying capacity [41,45]. 


Since these problems are analytically and computationally unfeasible for find- 
ing an optimal solution, we focus on designing simple, tractable and imple- 
mentable well-performing heuristic priority scheduling rules. 

For this aim, our research is focused on the Markovian Decision Processes 
(MDP) framework and on Gittins and Whittle methods [41,43-45] to obtain 
scheduling index rule solutions. In this way, first of all, the above scheduling 
problems are modeled in the framework of MDPs. Later, using methodologies 
based on Gittins or/and Whittle approaches for their resolution, we have pro- 
posed scheduling index rules with closed-form expression. 

The idea of Gittins consists in allocating resources to jobs with the current 
highest productivity of using the resource. The Gittins index is the value of 
the charge that provides that the expected serving-cost to the scheduler is in 
balance with the expected reward obtained when serving a job in r consecutive 
time slots, which results in the ratio between the expected total reward earned 
and the expected time spent in the system when serving a job in r consecutive 
time slots. 

On the other hand, the Whittle approach consists in obtaining a function that 
measures the dynamic service priority. For that purpose, the optimization prob- 
lem formulated as a Markov Decision Process (MDP) can be relaxed by requiring 
to serve a job per slot on average, which may allow introducing the constraint 
inside the objective function. Then, it is further approached by Lagrangian meth- 
ods and can be decomposed into a single-job price-based parametrized optimiza- 
tion problem. Since the Whittle index is the break-even value of the Lagrangian 
parameter, it can be interpreted as the per cost of serving. In such a way, the 
Whittle index represents the rate between marginal reward and marginal work, 
where marginal reward (work) is the difference between the expected total reward 
earned (work done) by serving and not serving at an initial state and then 
employing a certain optimal policy. 
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As a first step towards ACROSS targeted multi-criteria optimization, it is 
worth mentioning the utility-based MDP employed in [43,44] for QoE maximiza- 
tion. This function depended on delay only but we plan to extend it to a generic 
problem aimed at maximizing a multivariate objective function. Considering the 
meaning of work and reward in Whittle related modeling, such extension could 
demand the modification of the structure of the problem itself (i.e., alternative 
MDP) or just considering different criteria in the work/reward assignments. 

Although we carried out some very preliminary tests with LP and AHP 
based articulation of preferences for QoE vs. energy optimization in [27] we plan 
to further analyze index rules techniques in the multi-criteria problem. 


4 Current Technologies and Solutions 


Research on energy-aware control has been actively pursued in the academia 
already for a long time, and Sect. 3 introduced several scenarios that have ana- 
lyzed and given valuable insights to the fundamental tradeoff between energy 
efficiency and QoS/QoE. Due to the rising costs of energy, the industry is also 
actively developing solutions that would enable more energy efficient networks. 
Next we review industry efforts towards such architectures and finally we intro- 
duce a framework for energy-aware network management systems. 


4.1 Industry Efforts for Integrating Energy Consumption 
in Network Controlling Mechanisms 


New network technologies have been recently started to consider cost /energy 
issues in the early stages of the design and deployment process. Besides the 
infrastructure upgrade, the incorporation of such technologies requires the the 
network managers must handle a number of real-time parameters parameters to 
optimize Network energy /cost profile. These parameters include, among others, 
the sleep status of networks elements or the activation of mobile resources to 
provide extra coverage or change in performance status of some of the processors 
in the network. 

The fact is that energy consumption in networks is rising. Therefore, network 
equipment requires more power and greater amounts of cooling. According to 
[27]. By 2017 more than 5 zettabytes of data will pass through the network every 
year. The period 2010-2020 will see an important increase in ICT equipment to 
provide and serve this traffic. Smartphones and tablets will drive the mobile 
traffic to grow up to 89 times by 2020, causing energy use to grow exponentially. 
For example, mobile video traffic is expected to grow 870%, M2M (IoT) 990% 
and Applications 129%. As a consequence, ICT will consume 6% of Total of 
Global Energy consumption: in 2013 it was 109,1 GW according to the energy 
use models at different network levels shown in Table 4. 
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Table 4. Energy use models. 


Devices Networks 
PC’s 36,9 GW Home & Enterprise 9,5 GW 
Printer 0,9 GW Access 21,2 GW 
Smartphones 0,6 GW | Metro 0,6 GW Aggregation and transport 
Mobile 0,6 GW Edge 0,7 GW 
Tablets 0,2 GW Core 0,3 GW 
Service Provider & Data Center 37,1 GW 


One of the challenges the industry faces is how to support that growth in 
a sustainable and economically viable way. However, there is an opportunity 
for important reductions in the energy consumption because the networks are 
dimensioned in excess of current demand and even when the network is low 
in traffic the power used is very important and most of it is wasted [34]. The 
introduction of new technologies will provide a solution to improve the energy 
efficiency at the different scenarios (see Table 5). 


Table 5. Scenarios for energy efficiency increase. 


Home: Sleep mode 

Office: Cloud 

Access: VDL2, Vectoring, VoIP | Wireless Access: LTE Femto, Small, HetNet 
IP: MPLS Backhaul 


Fixed Wireless: Microwave Backhaul for Wireless 
2G 3G, Fiber 


Copper: VDL2, Vectoring, VoIP, PON 
Metro: IP/MPLS Transport, Packet Optical 

Edge: IP Edge 

IP Core: Next Gen IP Router and Transport (10 Gb) 


Service Provider & Data Center 


Current forecasts estimate that the trend will be to manage energy consump- 
tion and efficiency policies based on different types of traffic. Two organizations 
pursuing this goal are introduced next. 


GeSI Global e-Sustainability Initiative (GeSI) [2]. Building a sustainable 
world In collaboration with members from major Information and Communica- 
tion Technology (ICT) companies and organisations around the globe, the Global 
e-Sustainability Initiative (GeSI) is a leading source of impartial information, 
resources and best practices for achieving integrated social and environmental 
sustainability through ICT. 
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In a rapidly growing information society, technology presents both challenges 
and opportunities. GeSI facilitates real world solutions to real world issues both 
within the ICT industry and the greater sustainability community. We con- 
tribute to a sustainable future, communicate the industry’s corporate responsi- 
bility efforts, and increasingly drive the sustainability agenda. 

Members and Partners: ATT, Telecom Italia, Ericsson, KPN, Microsoft, 
Nokia, Nokia Siemens. 


Green Touch [3]. GreenTouch is a consortium of leading Information and 
Communications Technology (ICT) industry, academic and non-governmental 
research experts dedicated to fundamentally transforming communications and 
data networks, including the Internet, and significantly reducing the carbon foot- 
print of ICT devices, platforms and networks. 


4.2 C-RAN: Access Network Architecture of Future 5G Networks 


Cloud computing represents a paradigm shift in the evolution of ICT and has 
quickly become a key technology for offering new and improved services to con- 
sumers and businesses. Massive data centers, consisting of thousands of con- 
nected servers, are fundamental functional building blocks in the implementa- 
tion of cloud services. With the rapidly increasing adoption of cloud computing, 
the technology has faced many new challenges related to scalability, high capac- 
ity/reliability demands and energy efficiency. At the same time, the huge increase 
in the processing capacity enables the use of more accurate information that the 
control decision may be based on. This justifies the development of much more 
advanced control methods and algorithms, which is the objective of the work as 
described earlier in Sect. 3. 

To address the growing challenges, the research community has proposed 
several architectures for data centers, including FatTree, DCell, FiConn, Scafida 
and JellyFish [7]. On the other hand, vendors, such as, Google, Amazon, Apple, 
Google etc., have been developing their own proprietary solutions for the data 
centers which has created interoperability problems between service providers. 
To push forward the development of architectures addressing the challenges 
and to enable better interoperability between cloud service providers, IEEE has 
launched the IEEE Cloud Computing Initiative which is developing presently 
two standards in the area: IEEE P2301 Draft Guide for Cloud Portability and 
Interoperability Profiles and IEEE P2302 Draft Standard for Intercloud Inter- 
operability and Federation. 

Cloud-based approaches are also considered as part of the development of 
the future 5G networks. Namely, in the C-RAN (Cloud-Radio Access Network) 
architecture [35] the radio access network functionality is moved to the cloud. 
This means that all the radio resource management and cell coordination related 
functionality requiring complex computations are implemented in the cloud. This 
makes the functionality of the base stations simpler and hence also cheaper to 
manufacture. However, this places tough requirements on the computing capac- 
ity and efficiency of the centralized processing unit, essentially a data center, and 
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the interconnection network between the base stations and the data center. Sev- 
eral projects based on the C-RAN architecture have been initiated in the Next 
Generation Mobile Networks (NGMN) consortium and EU FP7 [10], and the 
C-RAN architecture will most likely be considered also in the standardization 
by 3GPP. 


4.3 A Framework for Energy-Aware Network Management Systems 


Considering the problem modeling and the existing optimization frameworks 
described in the previous sections, the challenge now is the integration of energy 
consumption in network controlling mechanisms. The networks in the data cen- 
ters and in the operators world are showing a fast evolution with growing size 
and complexity that should be tackled by increased flexibility with softwarization 
techniques. 

Emerging 5G Networks now exhibit extensive softwarization of all network 
elements: IoT, Mobile, and fiber optics-based transport core. This functions 
should be integrated in a network management environment with autonomous 
or semi-autonomous control response capabilities based on defined SLA’s and 
applying policies and using simulated scenarios and past history learning. 

By monitoring the energy parameters of radio access networks, fixed net- 
works, front haul and backhaul elements, with the VNFs supporting the internal 
network processes, and by estimating energy consumption and triggering reac- 
tions, the energy footprint of the network (especially backhaul and fronthaul) can 
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Fig. 5. MAPE-K diagram. 
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Fig. 6. Functional description of an energy management and monitoring application. 


be reduced while maintaining QoS for each VNO or end user. An Energy Manage- 
ment and Monitoring Application can be conveniently deployed along a standard 
ETSI MANO and collect energy-specific parameters like power consumption and 
CPU loads (see Figs.5 and 6). Such an Energy Management and Monitoring 
Application can also collect information about several network aspects such as 
traffic routing paths, traffic load levels, user throughput and number of ses- 
sions, radio coverage, interference of radio resources, and equipment activation 
intervals. All these data can be used to compute a virtual infrastructure energy 
budget to be used for subsequent analyses and reactions using machine learning 
and optimization techniques [11]. 

The application can optimally schedule the power operational states and the 
levels of power consumption of network nodes, jointly performing load balancing 
and frequency bandwidth assignment, in a highly heterogeneous environment. 
Also the re-allocation of virtual functions across backhaul and front haul will be 
done as part of the optimization actions, in order to cover virtual network func- 
tions to less power-consuming or less-loaded servers, thus reducing the overall 
energy demand from the network. 

Designing software systems that have to deal with dynamic operating con- 
ditions, such as changing availability of resources and faults that are difficult 
to predict, is complex. A promising approach to handle such dynamics is self- 
adaptation that can be realized by a Monitor-Analyze-Plan-Execute plus Knowl- 
edge (MAPE-K) feedback loop. To provide evidence that the system goals are 
satisfied, regarding the changing conditions, state of the art advocates the use 
of formal methods. 

Research in progress [1] tries to reinforce the approach of consolidating design 
knowledge of self-adaptive systems with the traditional tools of SLA’s and pol- 
icy modules and in particular with the necessity of defining the decision criteria 
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using formalized templates and making it understandable for a human opera- 
tor or manager via the human interfaces and dashboards as shown in Fig. 6. 
This figure shows the proposed architecture of an advanced Network Monitoring 
and Management System that includes energy management. At the top are the 
two agents responsible for the management of the network: On one side those 
responsible for the negotiating the SLA’s with the customers and of establishing 
the policies of the operation. On the other side those responsible for the detailed 
technical operation. These roles are supported by a set of applications and reside 
in the corresponding specialized cloud environments. The Business Intelligent 
cloud helps the Management API to generate dashboards for the optimization 
of the operation business results, issuing recommendations to the managers or 
autonomously implementing decisions. Those decisions will be based dynami- 
cally on contractual commitments, market conditions and customer’s needs. The 
operational cloud supports the technical operations with specialized technical AI 
dashboards using available information from many sources: Network monitoring 
information including real and historical performance data from the network, 
power data and network statistics. Simulated data can be used to support the 
operation by providing hypothetical failure scenarios, possible solutions and the 
impact of applying those solutions. This helps together with the historical data 
with the analysis of the consequences of possible decisions when trying to solve 
specific incidents. As in the Business application the operational cloud will anal- 
yse the scenarios and select the optimal configuration autonomously or mediated 
by the operator interaction via the corresponding dashboards reducing the total 
energy footprint of the network. At the bottom of Fig.6 is the SND network 
Controller with access to Network and Resource Monitoring and Topology that 
reacts to real Network events implementing the required network solution as 
directed by the layers above. 


5 Conclusions and Foreseen Future Research Lines 


5.1 Conclusions 


This chapter has addressed the challenges of combining energy and QoS/QoE 
issues in the management mechanisms of network and cloud services. Unfor- 
tunately, these design parameters are usually conflicting and it is necessary to 
introduce multi-criteria optimization techniques in order to achieve the required 
trade-off solution. 

So, as a first step, the common issues related to multi-objective optimiza- 
tion problems and mechanism have been depicted. These issues include typical 
preference articulation mechanisms, typical optimization methods and fairness 
considerations. Then, most well-known optimization methods have been briefly 
summarized in order to provide Internet of Services research community with a 
broad set of tools for properly addressing the inherent multi-criteria problems. 

Finally, in the multiuser/multiservice environments considered in ACROSS, 
how resources are distributed and the impact into different kind of users must 
be carefully tackled. As analyzed, fairness is most of the times considered once 
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the algorithm has selected the most efficient (i.e., optimal) solution. However, 
the incompatibility between fairness and efficiency is not a design problem of the 
optimization algorithms, but of the formulation of the problem to be optimized, 
where the fairness concept must be included. 

The next step is to model and analyze the problem of including the per- 
formance/energy trade-off into different scenarios in the scope of the ACROSS 
project. We start this analysis studying the use case of data centers modeled 
as queuing systems to develop policies for the optimal control of the QoS/QoE- 
energy balance. The trend in this area is to focus in small/moderate size data 
centers. 

Then, the second scenario focuses on the way different applications use the 
resources available in cloud environments and its impact in terms of energetic 
cost. Considering that increasing resources does not always benefit the perfor- 
mance of the system, we analyze two application-level approaches in order to 
improve energy efficiency: the characterization of the application execution pat- 
terns and the shared access to resources. 

Next, we show an example of energy-aware load balancing in 5G HetNets 
where cells of different sizes are used to adapt the coverage to the variations of 
user data traffic. We discuss the challenge of designing a load-balancing algorithm 
that considers both the performance of the system and the energy consumption 
of the whole system. The discussion suggests a MDP approach for the multi- 
criteria optimization problem. 

The last analyzed scenario presents a network services provider that shares 
resources among different traffic flows. The goal here is to introduce energy and 
cost into opportunistic QoE-aware scheduling. The research focuses on the use of 
MDP framework to model the scheduling problem and the application of Gittins 
and Whittle methods to obtain scheduling index rule solutions. 

Finally, the chapter compiles the current state of emerging technologies and 
foreseen solutions to the energy /performance trade-off issue in network and cloud 
management systems addressed in the ACROSS project. Based on the expected 
huge increase of network traffic and, in consequence, of energy consumption, the 
design of upcoming network management systems must face the challenge of 
addressing power efficiency while still meeting the KPIs of the offered services. 
Industry is already fostering innovative initiatives to integrate energy issues into 
network controlling mechanisms. 

In this direction, we present C-RAN architecture as the cloud-based solution 
for the future 5G access network. This approach moves all the radio resource 
management and cell coordination functionality to the cloud. The increasing 
complexity of the service management and orchestration in the cloud requires 
advanced network control methods and algorithms. Therefore, as final conclu- 
sion, we suggest a framework to include energy awareness in network manage- 
ment systems that implements a MAPE-K feedback loop. 
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5.2 Future Work 


The joint research accomplished in the scope of the COST ACROSS action has 
allowed the identification of common interests to develop in future collaborations. 
Remaining under the umbrella of Energy /Cost—-aware network management, this 
future work will strongly relay on the application of multi-criteria optimization 
techniques in order to cope with conflicting performance objectives. 

As previously concluded, the consideration of fairness in a optimization pro- 
cess does not fall to the multi-criteria optimization algorithm. On the contrary, it 
must be considered in the formulation of the design problem itself. Therefore one 
of the issues that will be addressed in future work grounded in the result of the 
COST ACROSS action is the inclusion of fairness among users/services/resource 
allocation in the definition network and services management optimization. 

Besides, analyzing the problem of the introduction of energy-awareness in 
load balancing processes in 5G HetNets, another of the proposed future research 
lines is to use MDP and Policy Iteration in order to optimize the dispatching 
problem focusing in small/moderate size data centers. Similarly, we also found 
common interests in the development of further analysis of index rules techniques 
in the multi-criteria problem of opportunistic QoE—aware scheduling. 

Finally, research in progress envisages innovative initiatives to integrate 
energy issues into network controlling mechanisms and interactive management 
approaches including self-adaption features. 
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Abstract. The chapter summarizes activities of COST IC1304 
ACROSS European Project corresponding to traffic management for 
Cloud Federation (CF). In particular, we provide a survey of CF archi- 
tectures and standardization activities. We present comprehensive multi- 
level model for traffic management in CF that consists of five levels: Level 
5 - Strategies for building CF, Level 4 - Network for CF, Level 3 - Service 
specification and provision, Level 2 - Service composition and orchestra- 
tion, and Level 1 - Task service in cloud resources. For each level we 
propose specific methods and algorithms. The effectiveness of these solu- 
tions were verified by simulation and analytical methods. Finally, we also 
describe specialized simulator for testing CF solution in loT environment. 


Keywords: Cloud federation - Traffic management 
Multi-layer model - Service provision - Service composition 


1 Introduction 


Cloud Federation (CF) extends the concept of cloud computing systems by merg- 
ing a number of clouds into one system. Thanks to this, CF has a potentiality 
to offer better service to the clients than it can be done by a separated cloud. 
This can happen since CF has more resources and may offer wider scope of ser- 
vices. On the other hand, the management of CF is more complex comparing to 
© The Author(s) 2018 
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this which is required for a standalone cloud. So, the effective management of 
resources and services in CF is the key point for getting additional profit from 
such system. CF is the system composing of a number of clouds connected by a 
network, as it is illustrated on Fig.1. The main concept of CF is to operate as 
one computing system with resources distributed among particular clouds. 
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Fig. 1. Exemplary CF consisting of 5 clouds connected by network. 


In this chapter we present a multi-level model for traffic management in CF. 
Each level deals with specific class of algorithms, which should together provide 
satisfactory service of the clients, while maintaining optimal resource utilization. 

The structure of the chapter is the following. In Sect.2 we present dis- 
cussed CF architectures and the current state of standardization. The proposed 
multi-level model for traffic management in CF is presented in Sect. 3. Section 4 
describes a simulation tool for analyzing performance of CF in Internet of Things 
(IoT) environment. Finally, Sect. 5 summarizes the chapter. 


2 Cloud Federation Architectures 


2.1 Cloud Architectural Views 


In general CF is envisaged as a distributed, heterogeneous environment con- 
sisting of various cloud infrastructures by aggregating different Infrastructure 
as a Service (IaaS) provider capabilities coming from possibly both the com- 
mercial and academic area. Nowadays, cloud providers operate geographically 
diverse data centers as user demands like disaster recovery and multi-site back- 
ups became widespread. These techniques are also used to avoid provider lock-in 
issues for users that frequently utilize multiple clouds. Various research communi- 
ties and standardization bodies defined architectural categories of infrastructure 
clouds. A current EU project on “Scalable and secure infrastructures for cloud 
operations” (SSICLOPS, www.ssiclops.eu) focuses on techniques for the manage- 
ment of federated private cloud infrastructures, in particular cloud networking 
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techniques within software-defined data centers and across wide-area networks. 
The scope of the SSICLOPS project includes high cloud computing workloads 
e.g. within the CERN computing cloud (home.cern/about/computing) as well 
as cloud applications for securing web access under challenging demands for low 
delay. An expert group set up by the European Commission published their view 
on Cloud Computing in [1]. These reports categorize cloud architectures into five 
groups. 


— Private Clouds consist of resources managed by an infrastructure provider 
that are typically owned or leased by an enterprise from a service provider. 
Usually, services with cloud-enhanced features are offered, therefore this group 
includes Software as a Service (SaaS) solutions like eBay. 

— Public Clouds offer their services to users outside of the company and may 
use cloud functionality from other providers. In this solution, enterprises can 
outsource their services to such cloud providers mainly for cost reduction. 
Examples of these providers are Amazon or Google Apps. 

— Hybrid Clouds consist of both private and public cloud infrastructures to 
achieve a higher level of cost reduction through outsourcing by maintaining 
the desired degree of control (e.g., sensitive data may be handled in private 
clouds). The report states that hybrid clouds are rarely used at the moment. 

— In Community Clouds, different entities contribute with their (usually small) 
infrastructure to build up an aggregated private or public cloud. Smaller enter- 
prises may benefit from such infrastructures, and a solution is provided by 
Zimory. 

— Finally, Special Purpose Clouds provide more specialized functionalities with 
additional, domain specific methods, such as the distributed document man- 
agement by Google’s App Engine. This group is an extension or a specializa- 
tion of the previous cloud categories. 


The third category called hybrid clouds are also referred as cloud federations 
in the literature. Many research groups tried to grasp the essence of federa- 
tion formation. In general, cloud federation refers to a mesh of cloud providers 
that are interconnected based on open standards to provide a universal decen- 
tralized computing environment where everything is driven by constraints and 
agreements in a ubiquitous, multi-provider infrastructure. Until now, the cloud 
ecosystem has been characterized by the steady rising of hundreds of indepen- 
dent and heterogeneous cloud providers, managed by private subjects, which 
offer various services to their clients. 

Buyya et al. [2] envisioned Cloud Computing as the fifth utility by satisfy- 
ing the computing needs of everyday life. They emphasized and introduced a 
market-oriented cloud architecture, then discussed how global cloud exchanges 
could take place in the future. They further extended this vision suggesting a 
federation oriented, just in time, opportunistic and scalable application services 
provisioning environment called InterCloud. They envision utility oriented fed- 
erated IaaS systems that are able to predict application service behavior for 
intelligent down and up-scaling infrastructures. They list the research issues of 
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flexible service to resource mapping, user and resource centric Quality of Service 
(QoS) optimization, integration with in-house systems of enterprises, scalable 
monitoring of system components. They present a market-oriented approach to 
offer InterClouds including cloud exchanges and brokers that bring together pro- 
ducers and consumers. Producers are offering domain specific enterprise Clouds 
that are connected and managed within the federation with their Cloud Coor- 
dinator component. 

Celesti et al. [3] proposed an approach for the federation establishment con- 
sidering generic cloud architectures according to a three-phase model, represent- 
ing an architectural solution for federation by means of a Cross-Cloud Federation 
Manager, a software component in charge of executing the three main function- 
alities required for a federation. In particular, the component explicitly manages: 


1. the discovery phase in which information about other clouds are received 
and sent, 

2. the match-making phase performing the best choice of the provider according 
to some utility measure and 

3. the authentication phase creating a secure channel between the federated 
clouds. These concepts can be extended taking into account green policies 
applied in federated scenarios. 


Bernstein et al. [4] define two use case scenarios that exemplify the problems 
of multi-cloud systems like 


1. Virtual Machines (VM) mobility where they identify the networking, the spe- 
cific cloud VM management interfaces and the lack of mobility interfaces as 
the three major obstacles and 

2. storage interoperability and federation scenario in which storage provider 
replication policies are subject to change when a cloud provider initiates sub- 
contracting. They offer interoperability solutions only for low-level functional- 
ity of the clouds that are not focused on recent user demands but on solutions 
for IaaS system operators. 


In the Federated Cloud Management solution [5], interoperability is achieved 
by high-level brokering instead of bilateral resource renting. Albeit this does 
not mean that different IaaS providers may not share or rent resources, but if 
they do so, it is transparent to their higher level management. Such a federation 
can be enabled without applying additional software stack for providing low- 
level management interfaces. The logic of federated management is moved to 
higher levels, and there is no need for adapting interoperability standards by the 
participating infrastructure providers, which is usually a restriction that some 
industrial providers are reluctant to undertake. 


2.2 Standardization for Cloud Federation 


Standardization related to clouds, cloud interoperability and federation has 
been conducted by the ITU (International Telecommunication Union) [6], 
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IETF (Internet Engineering Task Force) [7], NIST (National Institute of Stan- 
dards and Technology) [8] and IEEE (Institute of Electrical and Electronics 
Engineers) [9]. In 2014, the ITU released standard documents on the vocabu- 
lary, a reference architecture and a framework of inter-cloud computing. The 
latter provides an overview, functional requirements and refers to a number of 
use cases. The overview distinguishes between: 


— Inter-cloud Peering: between a primary and secondary CSP (i.e. Cloud Service 
Provider), where cloud services are provided by the primary CSP who estab- 
lishes APIs (application programming interfaces) in order to utilize services 
and resources of the secondary CSP, 

— Inter-cloud Intermediary: as an extension of inter-cloud peering including a set 
of secondary CSPs, each with a bilateral interface for support of the primary 
CSP which offers all services provided by the interconnected clouds, and 

— Inter-cloud Federation: which is based on a set of peer CSPs interconnected 
by APIs as a distributed system without a primary CSP with services being 
provided by several CSPs. For each service, the inter-cloud federation may act 
as an inter-cloud intermediary with a primary CSP responsible for the service. 
The user population may also be subdivided and attributed to several CSPs. 


The main functional requirements to set up and operate a cloud federation 
system are: 


— Networking and communication between the CSPs, 

— Service level agreement (SLA) and policy negotiations, 

— Resource provisioning and discovery mechanisms, 

— Resource selection, monitoring and performance estimation mechanisms, 
— Cloud service switch over between CSPs. 


Finally, the ITU [6] takes a number of use cases into account to be addressed 
by could interconnection and federation approaches: 


— Performance guarantee against an abrupt increase in load (offloading), 

— Performance guarantee regarding delay (optimization for user location), 

— Guaranteed availability in the event of a disaster or large-scale failure, 

— Service continuity (in the case of service termination of the original CSP), 
service operation enhancement and broadening service variety, 

— Expansion and distribution of cloud storage, media and virtual data center, 

— Market transactions in inter-cloud intermediary pattern and cloud service 
rebranding. 


The standardization on cloud federation has many aspects in common with 
the interconnection of content delivery networks (CDN). A CDN is an infras- 
tructure of servers operating on application layers, arranged for the efficient 
distribution and delivery of digital content mostly for downloads, software 
updates and video streaming. The CDN interconnection (CDNI) working group 
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of the IETF provided informational RFC standard documents on the prob- 
lem statement, framework, requirements and use cases for CDN interconnec- 
tion in a first phase until 2014. Meanwhile specifications on interfaces between 
upstream/downstream CDNs including redirection of users between CDNs have 
been issued in the proposed standards track [7]. CDNs can be considered as a 
special case of clouds with the main propose of distributing or streaming large 
data volumes within a broader service portfolio of cloud computing applications. 
The underlying distributed CDN architecture is also useful for large clouds and 
cloud federations for improving the system scalability and performance. This is 
reflected in a collection of CDNI use cases which are outlined in RFC 6770 [7] 
in the areas of: 


— footprint extension, 

— offloading, 

— resilience enhancement, 

— capability enhancements with regard to technology, QoS/QoE support, the 
service portfolio and interoperability. 


The CDNI concept is foreseen as a basis for CDN federations, where a fed- 
eration of peer CDN systems is directly supported by CDNI. A CDN exchange 
or broker approach is not included but can be build on top of core CDNI mech- 
anisms. 

In 2013, NIST [8] published a cloud computing standards roadmap includ- 
ing basic definitions, use cases and an overview on standards with focus on 
cloud/grid computing. Gaps are identified with conclusions on priorities for ongo- 
ing standardization work. However, a recently started standards activity by the 
IEEE [9] towards intercloud interoperability and federation is still motivated by 
today’s landscape of independent and incompatible cloud offerings in proprietary 
as well as open access architectures. 


3 Multi-level Model for Traffic Management in Cloud 
Federation 


Developing of efficient traffic engineering methods for Cloud Federation is essen- 
tial in order to offer services to the clients on appropriate quality level while 
maintaining high utilization of resources. These methods deal with such issues 
as distribution of resources in CF, designing of network connecting particular 
clouds, service provision, handling service requests coming from clients and man- 
aging virtual resource environment. The proposed traffic management model for 
CF consists of 5 levels, as it is depicted on Fig.2. Below we shortly discuss 
objectives of each level of the model. 

Level 5: This is the highest level of the model which deals with the rules 
for merging particular clouds into the form of CF. The addressed issue is e.g. 
amount of resources which would be delegated by particular clouds to CF. We 
assume that the main reason for constituting federation is getting more profit 
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Fig. 2. Traffic management model for Cloud Federation 


comparing to the situation when particular clouds work alone. So, this level deals 
with the conditions when CF can be attractive solution for cloud owners even 
if particular clouds differ in their capabilities, e.g. in amount of resources, client 
population and service request rate submitted by them. 

Level 4: This level deals with design of the CF network for connecting par- 
ticular clouds. Such network should be of adequate quality and, if it is possible, 
its transfer capabilities should be controlled by the CF network manager. The 
addressed issues are: required link capacities between particular clouds and effec- 
tive utilization of network resources (transmission links). We assume that net- 
work capabilities should provide adequate quality of the offered by CF services 
even when resources allocated for a given service (e.g. virtual machines) come 
from different clouds. Effective designing of the network in question is especially 
important when CF uses network provided by a network operator based on SLA 
(Service Level Agreement) and as a consequence it has limited possibilities to 
control network. Currently such solution is a common practice. 

Level 3: This level is responsible for handling requests corresponding to ser- 
vice installation in CF. The installation of new service requires: (1) specification 
of the service and (2) provision of the service. Specification of the service is pro- 
vided in the form of definition of appropriate task sequence that is executed in 
CF when a client asks for execution of this service. Furthermore, provision of 
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the service corresponds to allocation of resources when particular tasks can be 
executed. 

Level 2: This level deals with service composition and orchestration pro- 
cesses. So, the earlier specified sequence of tasks should be executed in response 
to handle service requests. Service composition time should meet user quality 
expectations corresponding to the requested service. 

Level 1: The last and the lowest level deals with task execution in cloud 
resources in the case when more than one task is delegated at the same time to 
be served by a given resource. So, appropriate scheduling mechanisms should be 
applied in order to provide e.g. fairness for tasks execution. In addition, impor- 
tant issue is to understand dependencies between different types of resources in 
virtualized cloud environment. 


3.1 Level 5: Strategy for Cloud Resource Distribution in Federation 


3.1.1 Motivation and State of the Art 

Cloud Federation is the system that is built on the top of a number of clouds. 
Such system should provide some additional profits for each cloud owner in 
comparison to stand-alone cloud. In this section we focus on strategies, in which 
way clouds can make federation to get maximum profit assuming that it is equally 
shared among cloud owners. 

Unfortunately, there are not too many positions dealing with discussed prob- 
lem. For instance in [10] the authors consider effectiveness of different federation 
schemes using the M/M/1 queueing system to model cloud. They assume that 
profit get from a task execution depends on the waiting time (showing received 
QoS) of this task. Furthermore, they consider scenarios when the profit is max- 
imized from the perspective of the whole CF, and scenarios when each cloud 
maximizes its profit. Another approach is presented in [11], where the author 
applied game theory to analyze the selfish behavior of cloud owner selling unused 
resources depending on uncertain load conditions. 


3.1.2 Proposed Model 

In the presented approach we assume that capacities of each cloud are charac- 
terized in terms of number of resources and service request rate. Furthermore, 
for the sake of simplicity, it is assumed that both types of resources and exe- 
cuted services are the same in each cloud. In addition, execution of each service 
is performed by single resource only. Finally, we will model each cloud by well- 
known loss queueing system M/M/c/c (e.g. [12]), where c denotes number of 
identical cloud resources, arrival service request rate follows Poisson distribu- 
tion with parameter A, service time distribution is done by negative exponential 
distribution with the rate 1/h (h is the mean service time). The performances 
of cloud system are measured by: (1) Pioss, which denotes the loss rate due 
to lack of available resources at the moment of service request arrival, and (2) 
Acarried = Ah(1 — Ptoss), which denotes traffic carried by the cloud, that corre- 
sponds directly to the resource utilization ratio. 
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Now, let us search for the appropriate scheme for building CF system. For 
this purpose, let us consider a number, say N, of clouds that intend to build 
CF where the i-th cloud (i = 1,..., N) is characterized by two parameters (A; 
and c;). In addition, the mean service times of service execution are the same 
in each cloud hı = hg = ... = hy = h. Subsequently we assume that h = 1, 
and as a consequence offered load A = Ah will be denoted as A = A. Next, 
the assumed objective function for comparing the discussed schemes for CF is 
to maximize profit coming from resource utilization delegated from each cloud 
to CF. Furthermore, the profit is equally shared among clouds participating in 
CF. Such approach looks to be reasonable (at least as the first approach) since 
otherwise in CF we should take into account requests coming from a given cloud 
and which resource (from each cloud) was chosen to serve the request. 

We consider three schemes: 


— Scheme no. 1 (see Fig. 3): this is the reference scheme when the clouds work 
alone, denoted by SC. 

— Scheme no. 2 (see Fig. 4): this scheme is named as full federation and assumes 
that all clouds dedicate all theirs resources and clients to the CF system. This 
scheme we denote as FC. 

— Scheme no. 3 (see Fig.5): for this scheme we assume that each cloud can 
delegate to CF only a part of its resources as well as a part of service requests 
coming from its clients. This scheme we name as PCF (Partial CF). 


First, let us compare the performances of schemes SC and FC in terms of 
resource utilization ratio and service request loss rate. The first observation 
is that FC scheme will have lower loss probabilities as well as better resource 
utilization ratio due to larger number of resources. But the open question is in 
which way to share profit gained from FC scheme when the clouds are of different 
capabilities? Table 1 shows exemplary results for the case, when the profit, which 
is consequence of better resources utilization, is shared equally among clouds. 

The results from Table 1 show that, as it was expected, FC scheme assures less 
service request loss rate and better resource utilization ratio for most of clouds 
(except cloud no. 1 that is under loaded). Note, that if we share the profit equally, 


the clouds with smaller service requests rate can receive more profit from FC 
| Cloud no. 1 Cloud no. 2 | Cloud no. N 
Service request rate A; Service request rate Az Service request rate Ay 
| | . e e | 
OOO OO° 0O O Cove 
U | J | J 
N We Ng 
number of resources c1 number of resources cz number of resources cy 


Fig. 3. Scenario with clouds working in separate way 
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Fig. 4. Scenario with clouds creating Cloud Federation based on full federation scheme 


scheme comparing to the SC scheme while the clouds with higher service request 
rate get less profit comparing to the SC scheme. So, one can conclude that FC 
scheme is optimal solution when the capabilities of the clouds are similar but if 
they differ essentially then this scheme simply fails. 

Scheme no. 3 mitigates the drawbacks of the schemes no. 1 and no. 2. As it 
was above stated, in this scheme we assume that each cloud can delegate to CF 
only a part of its resources as well as a part of service request rate submitted by 
its clients. The main assumptions for PFC scheme are the following: 


Table 1. Exemplary results comparing SC and FC schemes in terms of loss rate and 
resource utilization parameters. Number of clouds N = 5, values of A: Ai = 0.2, A2 = 
0.4,A3 = 0.6, A4 = 0.8, the same mean service times hi he hg ha hs 1, 
Number of resources in each cloud: cı = c2 = c3 = c4 = C5 = 10. 


Cloud characteristics SC scheme FC scheme 
No. | Service Number of | Resource | Loss Resource | Loss 
requests rate | resources _| utilization | rate [%] | utilization | rate[%] 

1 2 10 0.2 <0.01 0.6 0.02 
2 4 10 0.398 0.54 0.6 0.02 
3 6 10 0.575 4.3 0.6 0.02 
4 8 10 0.703 12 0.6 0.02 
5 10 10 0.786 21 0.6 0.02 


1. we split the resources belonging to the i-th cloud (i = 1,..., N), say ci, into 2 
main subsets: 
— set of private resources that are delegated to handle only service requests 
coming from the i-th cloud clients 
— set of resources dedicated to Cloud Federation for handling service requests 
coming from all clouds creating Cloud Federation, denoted as c;3 
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2. we again split the private resources into two categories: 
— belonging to the 1st category, denoted as c;;, which are dedicated as the 
first choice to handle service requests coming from the i-th cloud clients 
— belonging to the 2nd category, denoted as c;2, which are dedicated to 
handle service requests coming from the i-th cloud clients that were not 
served by resources from 1st category as well as from common pool since 
all these resources were occupied. 


The following relationship holds: 
Ci = Cil + Ci2 + C3, for i =1,...,.N. (1) 


The handling of service requests in PFC scheme is shown on Fig. 5. The 
service requests from clients belonging e.g. to cloud no. i (i = 1,...,N) are 
submitted as the first choice to be handled by private resources belonging to the 
lst category. In the case, when these resources are currently occupied, then as 
the second choice are the resources belonging to common pool. The number of 
common pool resources equals (c13 + c23 + ... + ews). If again these resources 
are currently occupied then as the final choice are the resources belonging to the 
2nd category of private resources of the considered cloud. The service requests 
are finally lost if also no available resources in this pool. 

Next, we show in which way we count the resources belonging to particular 
clouds in order to get maximum profit (equally shared between the cloud owners). 
We stress that the following conditions should be satisfied for designing size of 
the common pool: 

Condition 1: service request rate (offered load) submitted by particular clouds 
to the common pool should be the same. It means that 


Prossi(A1, €11)A1 = Pioss2 (A2, €21)A2 = ... = PiossN (ÀN, CN1)ÀN (2) 


where the value of Pioss(Ai, ci1) we calculate from the analysis of the system 
M/M/n/n by using Erlang formula: 


Cil 
Ài 
i] 


Prossi(Xi; Ci) = =e 
iv AY 
dij=0 Ft 


Note that we only require that mean traffic load submitted from each cloud to 
common pool should be the same. Let us note, that the service request arrival 
processes from each cloud submitted to this pool are generally different. It is 
due to the fact that these requests were not served by Ist category of private 
resources and as a consequence they are not still Poissonian. 

Condition 2: the number of resources dedicated from each cloud to the com- 
mon pool should be the same 


C13 = Co3 = --. = CN3- 
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Fig. 5. Handling of service requests in PFC scheme. 


Finally, the algorithm for calculating resource distribution for each cloud is 
the following: 

Step 1: to order A; (i = 1,..., N) values from minimum value to maximum. 
Let the k-th cloud has minimum value of A. 

Step 2: to calculate (using Formula 2) for each cloud the values of the num- 
ber of resources delegated to category 1 of private resources, cj, (i = 1,...,.N) 
assuming that cp, = 0. 

Step 3: to choose the minimum value from set of (c; — cj) (i = 1, ..., N) and 
to state that each cloud should delegate this number of resources to the common 
pool. Let us note that if for the i-th cloud the value of (ci — ci1) < 0 then no 
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common pool can be set and, as a consequence, not conditions are satisfied for 
Cloud Federation. 

Step 4: to calculate from the Formula 1 the number of 2nd category of private 
resources cjg (i = 1, ..., N) for each cloud. 


3.1.3 Exemplary Results 

Now we present some exemplary numerical results showing performances of the 
described schemes. The first observation is that when the size of common pool 
grows the profit we can get from Cloud Federation also grows. 

Example: In this example we have 10 clouds that differ in service request 
rates while the number of resources in each cloud is the same and is equal to 
10. Table 2 presents the numerical results corresponding to traffic conditions, 
number of resources and performances of the systems build under SC and PFC 
schemes. The required amount of resources belonging to particular categories 
were calculated from the above described algorithm. 

Table2 says that thanks to the PFC scheme we extend the volume of 
served traffic from 76,95 up to 84,50 (about 10%). The next step to increase 
Cloud Federation performances is to apply FC scheme instead of PFC scheme. 


Table 2. Numerical results showing comparison between SC and PFC schemes. 


Clouds SC scheme PFC scheme 
No. |Service (Number |Load Loss L1 |L2 L3 L4 |L5 |L6 |L7 |L8& L9 |L10 
requests (of served by |rate 
rate resources |cloud [%] 
1 7.5 10 6.75 10 (7.50) 0| 5) 5| 0.00} 2.34) 4.82) 7.16)3.5/0.41 
2 8.4 10 7.22 14 7.50| 1| 4/ 5| 0.89| 2.10| 4.82| 7.82|6.3|0.60 
3 8.4 10 7.22 14 7.50| 1; 4/ 5| 0.89| 2.10| 4.82| 7.82|6.3|0.60 
4 9.3 10 7.61 18 (7.50) 2| 3) 5| 1.79| 1.75| 4.82| 8.35|10 |0.74 
5 9.3 10 7.61 18 (7.50) 2| 3) 5| 1.79| 1.75| 4.82| 8.35/10 |0.74 
6 10.2 10 7.91 22 |7.50| 3| 2| 5| 2.69| 1.26| 4.82| 8.77|14 |0.86 
7 10.2 10 7.91 22 7.50} 3| 2) 5| 2.69} 1.26| 4.82] 8.77|14 |0.86 
8 11.1 10 8.17 26 |7.50| 4| 1| 5| 3.58] 0.68} 4.82} 9.08/19 |0.91 
9 11.1 10 8.17 26 7.50} 4| 1| 5| 3.58| 0.68| 4.82| 9.08/19 |0.91 
10 12 10 8.38 30 7.50| 5| 0; 5| 4.49| 0.00| 4.82| 9.31/23 |0.92 
Total|97.5 100 76.95 75 |25 25 |50 |22.39|13.91|48.2 |84.50 7.55 


L1: offered load to common pool 

L2: number of the 1st category of private resources 

L3: number of the 2nd category of private resources 

L4: number of resources delegated to common pool 

L5: load served by the 1st category of private resources 
L6: load served by the 2nd category of private resources 
L7: load served by common pool of resources 

L8: total load served by clouds 

L9: loss rate [%] 

L10: load served gain comparing to SC scheme 
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Unfortunately, it is not possible to be done in a straightforward way. It needs a 
moving of resources or service request rates between particular clouds. Table 3 
presents moving of service request rates in the considered example to make trans- 
formation from PFC scheme into the form of FC scheme. For instance, cloud no. 
1 should buy value of service request rate of 2.25 while cloud no. 10 should sell 
value of service request rate also of 2.25. Finally, after buying/selling process, 
one can observe that the profit gained from FC scheme is greater than the profit 
we have got from PFC scheme and now is equal to 91.50 (19% comparing to SC 
scheme and 8% comparing to PFC scheme). 

Concluding, the presented approach for modeling different cloud federation 
schemes as FC and PFC could be only applied for setting preliminary rules 
for establishing CF. Anyway, it appears that in some cases by using simple FC 
scheme we may expect the problem with sharing the profit among CF owners. 
More precisely, some cloud owners may lost or extend their profits comparing 
to the case when their clouds work alone. Of course, more detailed model of CF 
is strongly required that also takes into account such characteristics as types of 
offered services, prices of resources, charging, control of service requests etc. 


Table 3. Example showing system transformation into FC scheme. 


Clouds FC scheme 
No. | Service | Number of | Service Service L1 L2 |L3|L4 |L5 
requests | resources requests requests 
rate rate to sell | rate to buy 
1 7.5 10 0 2.25 9.75 | 9.15 | 6.2 | 9.09 | 9.01 
2 8.4 10 0) 1.35 9.75 | 9.15 | 6.2 | 9.09 | 9.01 
3 8.4 10 0 1.35 9.75 | 9.15 | 6.2 | 9.05 | 8.97 
4 9.3 10 0 0.45 9.75 | 9.15 | 6.2 | 9.05 | 8.97 
5 9.3 10 0 0.45 9.75 | 9.15 | 6.2 | 9.01 | 8.93 
6 10.2 10 0.45 0 9.75 | 9.15 | 6.2 | 9.01 | 8.93 
7 10.2 10 0.45 0 9.75 | 9.15 | 6.2 | 8.96 | 8.89 
8 11.1 10 1.35 0 9.75 | 9.15 | 6.2 | 8.96 | 8.89 
9 11.1 10 1.35 0 9.75 | 9.15 | 6.2 | 8.92 | 8.85 
10 12 10 2.25 0 9.75 | 9.15 | 6.2 | 9.15 | 9.15 
Total | 97.5 100 5.85 5.85 97.5 | 91.5 91.5 | 91.5 


L1: offered load to common pool 

L2: load served by common pool of resources 
L3: loss rate [%] 

L4: load served gain comparing to PFC scheme 
L5: load served gain comparing to SC scheme 
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3.2 Level 4: Network for Cloud Federation 


3.2.1 Motivation and State of the Art 

The services offered by CF use resources provided by multiple clouds with dif- 
ferent location of data centers. Therefore, CF requires an efficient, reliable and 
secure inter-cloud communication infrastructure. This infrastructure is espe- 
cially important for mission critical and interactive services that have strict 
QoS requirements. Currently, CF commonly exploits the Internet for inter-cloud 
communication, e.g. CONTRAIL [13]. Although this approach may be suffi- 
cient for non-real time services, i.e., distributed file storage or data backups, 
it inhibits deploying more demanding services like augmented or virtual real- 
ity, video conferencing, on-line gaming, real-time data processing in distributed 
databases or live video streaming. The commonly used approach for ensuring 
required QoS level is to exploit SLAs between clouds participating in CF. These 
SLAs are established on demand during the service provisioning process (see 
Level 3 of the model in Fig. 2) and use network resources coming from network 
providers. However, independently established SLAs lead to inefficient utilization 
of network resources, suffer scalability concerns and increase operating expen- 
ditures (OPEX) costs paid by CF. These negative effects become critical for 
large CFs with many participants as well as for large cloud providers offer- 
ing plethora of services. For example, the recent experiences of Google cloud 
point out that using independent SLAs between data centers is ineffective [14]. 
Therefore, Google creates their own communication infrastructure that can be 
optimized and dynamically reconfigured following demands of currently offered 
services, planned maintenance operations as well as restoration actions taken to 
overcome failures. 


3.2.2 Proposed Solution 

The proposed approach for CF is to create, manage and maintain a Virtual Net- 
work Infrastructure (VNI), which provides communication services tailored for 
inter-cloud communication. The VNI is shared among all clouds participating in 
CF and is managed by CF orchestration and management system. Actually, VNI 
constitutes a new “service component” that is orchestrated during service provi- 
sioning process and is used in service composition process. The key advantages 
of VNI are the following: 


1. The common orchestration of cloud and VNI resources enables optimization 
of service provisioning by considering network capabilities. In particular, CF 
can benefit from advanced traffic engineering algorithms taking into account 
knowledge about service demands and VNI capabilities, including QoS guar- 
antees and available network resources. The objective function of designed 
algorithms may cover efficient load balancing or maximization and fair share 
of the CF revenue. 

2. New communication facilities tailored for cloud services: 

— The cloud services significantly differ in QoS requirements, e.g. interac- 
tive services are delay sensitive, while video on demand or big data storage 
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demands more bandwidth. Therefore, VNI should differentiate packet ser- 
vice and provide QoS guaranties following user’s requirements. The key 
challenge is to design a set of Classes of Services (CoS) adequate for han- 
dling traffic carried by federation. These CoSs are considered in the service 
orchestration process. 

— The VNI should offer multi-path communication facilities that support 
multicast connections, multi-side backups and makes effective communi- 
cation for multi-tenancy scenarios. The key challenge is developing a scal- 
able routing and forwarding mechanisms able to support large number of 
multi-side communications. 


The VNI is created following the Network as a Service (NaaS) paradigm 
based on resources provided by clouds participating in CF. Each cloud should 
provide: (1) virtual network node, which is used to send, receive or transit packets 
directed to or coming from other clouds, and (2) a number of virtual links estab- 
lished between peering clouds. These links are created based on SLAs agreed 
with network provider(s). The VNI exploits advantages of the Software Defined 
Networking (SDN) concept supported by network virtualization techniques. It 
makes feasible separation of network control functions from underlying physical 
network infrastructure. In our approach, CF defines its own traffic control and 
management functions that operate on an abstract model of VNI. The manage- 
ment focuses on adaptation of VNI topology, provisioning of resources allocated 
to virtual nodes and links, traffic engineering, and costs optimization. On the 
other hand, this VNI model is used during the service composition phase for 
dynamic resource allocation, load balancing, cost optimization, and other short 
time scale operations. Finally, decisions taken by VNI control functions on the 
abstract VNI model are translated into configuration commands specific for par- 
ticular virtual node. 


7 User Cloud federation 
User Cloud federation Network operator orchestration & management 
orchestration & management 
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(a) communication based on SLA peering. (b) communication based on VNI. 


Fig. 6. Two reference network scenarios considered for CF. 


Traffic Management for Cloud Federation 285 


Figure 6 shows the reference network scenarios considered for CF. Figure 6a 
presents the scenario where CF exploits only direct communication between 
peering clouds. In this scenario, the role of CF orchestration and management is 
limited to dynamic updates of SLAs between peering clouds. Figure 6b presents 
scenario where CF creates a VNI using virtual nodes provided by clouds and 
virtual links provided by network operators. The CF orchestration and man- 
agement process uses a VNI controller to setup/release flows, perform traffic 
engineering as well as maintain VNI (update of VNI topology, provisioning of 
virtual links). 


The Control Algorithm for VNI. The VNI is controlled and managed by a 
specialized CF network application running on the VNI controller. This applica- 
tion is responsible for handling flow setup and release requests received from the 
CF orchestration and management process as well as for performing commonly 
recognized network management functions related to configuration, provisioning 
and maintenance of VNI. The flow setup requires a specialized control algorithm, 
which decides about acceptance or rejection of incoming flow request. Admis- 
sion decision is taken based on traffic descriptor, requested class of service, and 
information about available resources on routing paths between source and des- 
tination. In order to efficiently exploit network resources, CF uses multi-path 
routing that allows allocating bandwidth between any pair of network nodes 
up to the available capacity of the minimum cut of the VNI network graph. 
Thanks to a logically centralized VNI architecture, CF may exploit different 
multi-path routing algorithms, e.g. [15,16]. We propose a new k-shortest path 
algorithm which considers multi-criteria constraints during calculation of alter- 
native k-shortest paths to meet QoS objectives of classes of services offered in CF. 
We model VNI as a directed graph G(N, E), where N represents the set of virtual 
nodes provided by particular cloud, while Æ is the set of virtual links between 
peering clouds. Each link u —> v,u,v E€ N,u —> v E E, is characterized by a 
m—dimensional vector of non-negative link weights w(u — v) = [w1, W2, . .., Wm] 
which relates to QoS requirements of services offered by CF. Any path p 
established between two nodes is characterized by a vector of path weights 
w(p) = [wi(p), w2(p), . --, Wm(p)], where w;(p) is calculated as a concatenation 
of link weights w; of each link belonging to the path p. The proposed multi- 
criteria, k-shortest path routing algorithm finds a set of Pareto optimum paths, 
f € F, between each pair of source to destination nodes. A given path is Pareto 
optimum if its path weights satisfy constraints: w;(f) < li,i =1,...,m, where L 
is the vector of assumed constraints L = [l,,lo,...,lm] and it is non-dominated 
within the scope of the considered objective functions. Note that proposed multi- 
criteria, k-shortest path routing algorithm runs off-line as a sub-process in CF 
network application. It is invoked in response to any changes in the VNI topology 
corresponding to: instantiation or release of a virtual link or a node, detection 
of any link or node failures as well as to update of SLA agreements. 

The VNI control algorithm is invoked when a flow request arrives from the 
CF orchestration process. The algorithm is responsible for: (1) selection of a 
subset of feasible alternative routing paths which satisfy QoS requirements of 
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the requested flow. Notice, that bandwidth requested in the traffic descriptor 
may be satisfied by a number of alternative path assuming flow splitting among 
them, (2) allocation of the flow to selected feasible alternative routing paths, 
and (3) configuration of flow tables in virtual nodes on the selected path(s). 
The main objective of the proposed VNI control algorithm is to maximize the 
number of requests that are served with the success. This goal is achieved through 
smart allocation algorithm which efficiently use network resources. Remark, that 
flow allocation problem belongs to the NP-complete problems. The allocation 
algorithm has to take decision in a relatively short time (of second order) to not 
exceed tolerable request processing time. This limitation opt for using heuristic 
algorithm that find feasible solution in a reasonable time, although selected 
solution may not be the optimal one. 
The proposed VNI control algorithm performs the following steps: 


1. Create a decision space. In this step the algorithm creates a subset of feasi- 
ble alternative paths that meet QoS requirements from the set of k-shortest 
routing paths. The algorithm matches QoS requirements with path weights 
w(p). Then, it checks if selected subset of feasible alternative paths can meet 
bandwidth requirements, i.e. if the sum of available bandwidth on disjointed 
paths is greater than requested bandwidth. Finally, the algorithm returns the 
subset of feasible paths if the request is accepted or returns empty set Ø, 
which results in flow rejection. 

2. Allocate flow in VNI. In this step, the algorithm allocates flow into previously 
selected subset of feasible paths. The allocation may address different objec- 
tives, as e.g. load balancing, keeping the flow on a single path, etc. depending 
on the CF strategy and policies. In the proposed algorithm, we allocate the 
requested flow on the shortest paths, using as much as possible limited num- 
ber of alternative paths. So, we first try to allocate the flow on the latest 
loaded shortest path. If there is not enough bandwidth to satisfy demand, 
we divide the flow over other alternative paths following the load balancing 
principles. If we still need more bandwidth to satisfy the request, we consider 
longer alternative paths in consecutive steps. The process finishes when the 
requested bandwidth is allocated. 

3. Configure flow tables. In the final step, the VNI control algorithm configures 
allocated paths using the abstract model of VNI maintained in the SDN 
controller. The actual configuration is performed by the management system 
of particular cloud using e.g. Open Flow protocol, net conf or other. 


3.2.3 Performance Evaluation 

The experiments focus on performance evaluation of the proposed VNI control 
algorithm. They are performed assuming a model of CF comprising n clouds 
offering the same set of services. A CF network assumes a full mesh topology 
where peering clouds are connected by virtual links. In this model the number 
of degree of freedom in selecting alternative paths is relatively large. Our experi- 
ments are performed by simulation. We simulate flow request arrival process and 
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Fig. 7. Blocking probabilities of flow requests served by VNI using different number of 
alternative paths. 


analyze the system performances in terms of request blocking probabilities. We 
analyze the effectiveness of the VNI control algorithm under the following condi- 
tions: (1) number of alternative paths established in VNI, and (2) balanced and 
unbalanced load conditions. Notice, that results related to a single path, denoted 
as 1 path, correspond to the strategy based on choosing only direct virtual links 
between peering clouds, while other cases exploit multi-path routing capabilities 
offered by VNI. 

Figure 7 presents exemplary results showing values of request blocking prob- 
abilities as a function of offered load obtained for VNI using different number 
of alternative paths. Figure 7a corresponds to balanced load conditions where 
each relation of source to destination is equally loaded in the network. Further- 
more, Fig. 7b shows values of blocking probabilities for extremely unbalanced 
load conditions, where flows are established between a chosen single relation. 
One can observe that using VNI instead of direct communication between peer- 
ing clouds leads to significant decreasing of blocking probabilities under wide 
range of the offered load up to the limit of the working point at blocking proba- 
bility at the assumed level of 0.1. One can also observe that by using alternative 
paths we significantly increase carried traffic under the same blocking probabil- 
ity. Moreover, the gain from using alternative paths is mostly visible if we use 
the first alternative path. Increasing the number of alternative paths above four 
or five practically yields no further improvement. The gain becomes especially 
significant under unbalanced load conditions. 


3.3 Level 3: Service Provision 


Motivation. While traditionally a cloud infrastructure is located within a data- 
center, recently, there is a need for geographical distribution [17]. For instance, 
cloud federation can combine the capabilities of multiple cloud offerings in order 
to satisfy the user’s response time or availability requirements. Lately, this need 
for geo-distribution has led to a new evolution of decentralization. Most notably, 
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the extension of cloud computing towards the edge of the enterprise network, is 
generally referred to as fog or edge computing [18]. In fog computing, computa- 
tion is performed at the edge of the network at the gateway devices, reducing 
bandwidth requirements, latency, and the need for communicating data to the 
servers. Second, mist computing pushes processing even further to the network 
edge, involving the sensor and actuator devices [19]. 

Compared to a traditional cloud computing environment, a geo-distributed 
cloud environment is less well-controlled and behaves in an ad-hoc manner. 
Devices may leave and join the network, or may become unavailable due to 
unpredictable failures or obstructions in the environment. 

Additionally, while in a data-center heterogeneity is limited to multiple gen- 
erations of servers being used, there is a large spread on capabilities within a 
geo-distributed cloud environment. Memory and processing means range from 
high (e.g. servers), over medium (e.g. cloudlets, gateways) to very low (e.g. 
mobile devices, sensor nodes). While some communication links guarantee a 
certain bandwidth (e.g. dedicated wired links), others provide a bandwidth with 
a certain probability (e.g. a shared wired link), and others do not provide any 
guarantees at all (wireless links). 

Reliability is an important non-functional requirement, as it outlines how 
a software systems realizes its functionality [20]. The unreliability of substrate 
resources in a heterogeneous cloud environment, severely affects the reliability 
of the applications relying on those resources. Therefore, it is very challenging 
to host reliable applications on top of unreliable infrastructure [21]. 

Moreover, traditional cloud management algorithms cannot be applied here, 
as they generally consider powerful, always on servers, interconnected over wired 
links. Many algorithms do not even take into account bandwidth limitations. 
While such an omission can be justified by an appropriately over provisioned net- 
work bandwidth within a data-center, it is not warranted in the above described 
geo-distributed cloud networks. 


State of the Art. In this section, the state of the art with regard to the 
Application Placement Problem (APP) in cloud environments is discussed. Early 
work on application placement merely considers nodal resources, such as Central 
Processing Unit (CPU) and memory capabilities. Deciding whether requests are 
accepted and where those virtual resources are placed then reduces to a Multiple 
Knapsack Problem (MKP) [22]. An MKP is known to be NP-hard and therefore 
optimal algorithms are hampered by scalability issues. A large body of work has 
been devoted to finding heuristic solutions [23-25]. 

When the application placement not only decides where computational enti- 
ties are hosted, but also decides on how the communication between those 
entities is routed in the Substrate Network (SN), then we speak of network- 
aware APP. Network-aware application placement is closely tied to Virtual 
Network Embedding (VNE) [26]. An example of a network-aware approach is 
the work from Moens et al. [27]. It employs a Service Oriented Architecture 
(SOA), in which applications are constructed as a collection of communicating 
services. This optimal approach performs node and link mapping simultaneously. 
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In contrast, other works try to reduce computational complexity by performing 
those tasks in distinct phases [28, 29]. 

While the traditional VNE problem assumes that the SN network remains 
operational at all times, the Survivable Virtual Network Embedding (SVNE) 
problem does consider failures in the SN. For instance, Ajtai et al. try and 
guarantee that a virtual network can still be embedded in a physical network, 
after k network components fail. They provide a theoretical framework for fault- 
tolerant graphs [30]. However, in this model, hardware failure can still result 
in service outage as migrations may be required before normal operation can 
continue. 

Mihailescu et al. try to reduce network interference by placing Virtual 
Machines (VMs) that communicate frequently, and do not have anti-collocation 
constraints, on Physical Machines (PMs) located on the same racks [31]. Addi- 
tionally, they uphold application availability when dealing with hardware fail- 
ures by placing redundant VMs on separate server racks. A major shortcoming 
is that the number of replicas to be placed, and the anti-collocation constraints 
are user-defined. 

Csorba et al. propose a distributed algorithm to deploy replicas of VM images 
onto PMs that reside in different parts of the network [32]. The objective is to 
construct balanced and dependable deployment configurations that are resilient. 
Again, the number of replicas to be placed is assumed predefined. 

SiMPLE allocates additional bandwidth resources along multiple disjoint 
paths in the SN [33]. This proactive approach assumes splittable flow, i.e. the 
bandwidth required for a Virtual Link (VL) can be realized by combining mul- 
tiple parallel connections between the two end points. The goal of SIMPLE is 
to minimize the total bandwidth that must be reserved, while still guarantee- 
ing survivability against single link failures. However, an important drawback 
is that while the required bandwidth decreases as the number of parallel paths 
increases, the probability of more than one path failing goes up exponentially, 
effectively reducing the VL’s availability. 

Chowdhury et al. propose Dedicated Protection for Virtual Network Embed- 
ding (DRONE) [34]. DRONE guarantees Virtual Network (VN) survivability 
against single link or node failure, by creating two VNEs for each request. These 
two VNEs cannot share any nodes and links. 

Aforementioned SVNE approaches [30-34] lack an availability model. When 
the infrastructure is homogeneous, it might suffice to say that each VN or VNE 
need a predefined number of replicas. However, in geo-distributed cloud environ- 
ments the resulting availability will largely be determined by the exact placement 
configuration, as moving one service from an unreliable node to a more reliable 
one can make all the difference. Therefore, geo-distributed cloud environments 
require SVNE approaches which have a computational model for availability as 
a function of SN failure distributions and placement configuration. 

The following cloud management algorithms have a model to calculate avail- 
ability. Jayasinghe et al. model cloud infrastructure as a tree structure with arbi- 
trary depth [35]. Physical hosts on which Virtual Machines (VMs) are hosted 
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are the leaves of this tree, while the ancestors comprise regions and availability 
zones. The nodes at bottom level are physical hosts where VMs are hosted. Wang 
et al. were the first to provide a mathematical model to estimate the resulting 
availability from such a tree structure [36]. They calculate the availability of a 
single VM as the probability that neither the leaf itself, nor any of its ancestors 
fail. Their work focuses on handling workload variations by a combination of 
vertical and horizontal scaling of VMs. Horizontal scaling launches or suspends 
additional VMs, while vertical scaling alters VM dimensions. The total availabil- 
ity is then the probability that at least one of the VMs is available. While their 
model suffices for traditional clouds, it is ill-suited for a geo-distributed cloud 
environment as link failure and bandwidth limitations are disregarded. 

In contrast, Yeow et al. define reliability as the probability that critical nodes 
of a virtual infrastructure remain in operation over all possible failures [37]. They 
propose an approach in which backup resources are pooled and shared across 
multiple virtual infrastructures. Their algorithm first determines the required 
redundancy level and subsequently performs the actual placement. However, 
decoupling those two operations is only possible when link failure can be omitted 
and nodes are homogeneous. 


Availability Model. In this section we introduce an availability model for geo- 
distributed cloud networks, which considers any combination of node and link 
failures, and supports both node and link replication. Then, building on this 
model, we will study the problem of guaranteeing a minimum level of availabil- 
ity for applications. In the next section, we introduce an Integer Linear Program 
(ILP) formulation of the problem. The ILP solver can find optimal placement 
configurations for small scale networks, its computation time quickly becomes 
unmanageable when the substrate network dimensions increase. Subsequently 
two heuristics are presented: (1) a distributed evolutionary algorithm employing 
a pool-model, where execution of computational tasks and storage of the pop- 
ulation database (DB) are separated (2) a fast centralized algorithm, based on 
subgraph isomorphism detection. Finally, we evaluate the performance of the 
proposed algorithms. 


3.8.0.1 Application Requests. We consider a SOA, which is a way of structur- 
ing IT solutions that leverage resources distributed across the network [38]. In 
a SOA, each application is described as its composition of services. Through- 
out this work, the collected composition of all requested applications will be 
represented by the instance matrix (J). 

Services have certain CPU (w) and memory requirements (y). Additionally, 
bandwidth (8) is required by the VLs between any two services. A sub-modular 
approach allows sharing of memory resources amongst services belonging to mul- 
tiple applications. 


8.3.0.2 Cloud Infrastructure. Consider a substrate network consisting of nodes 
and links. Nodes have certain CPU (Q) and memory capabilities (T). Physical 
links between nodes are characterized by a given bandwidth (B). Both links and 
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Table 4. Overview of input variables to the Cloud Application Placement Problem 
(CAPP). 


Symbol | Description 


A Set of requested applications 

S Set of services 

Ws CPU requirement of service s 

Ys Memory requirement of service s 


Bs,,s9 | Bandwidth requirement between services sı and s2 


Ia,s Instantiation of service s by application a: 1 if instanced, else 0 
N Set of physical nodes comprising the substrate network 
E Set of physical links (edges) comprising the substrate network 
Qn CPU capacity of node n 
Fr Memory capacity of node n 
py Probability of failure of node n 
Be Bandwidth capacity of link e 
E Probability of failure of link e 
Ra Required total availability of application a: lower bound on the 


probability that at least one of the duplicates for a is available 


ô Maximum allowed number of duplicates 


nodes have a known probability of failure, p and pë respectively. Failures are 
considered to be independent. 


3.3.0.8 The VAR Protection Method. Availability not only depends on failure 
in the SN, but also on how the application is placed. Non-redundant application 
placement assigns each service and VL at most once, while its redundant counter- 
part can place those virtual resources more than once. The survivability method 
presented in this work, referred to as VAR, guarantees a minimum availability by 
application level replication, while minimizing the overhead imposed by alloca- 
tion of those additional resources. VAR. uses a static failure model, i.e. availabil- 
ity only depends on the current state of the network. Additionally, it is assumed 
that upon failure, switching between multiple application instances takes place 
without any delay. These separate application instances will be referred to as 
duplicates. Immediate switchover yields a good approximation, when the dura- 
tion of switchover is small compared to the uptime of individual components. 
A small switchover time is feasible, given that each backup service is preloaded 
in memory, and CPU and bandwidth resources have been preallocated. Further- 
more, immediate switchover allows condensation of the exact failure dynamics 
of each component, into its expected availability value, as long as the individual 
components fail independently (a more limiting assumption). 
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Fig. 8. Overview of this work: services {w, y, 8}, composing applications {I}, are 
placed on a substrate network where node {p™ } and link failure {p”} is modeled. 
By increasing the redundancy 6, a minimum availability R can be guaranteed. 


Table 5. An overview of resource sharing amongst identical services and VLs. 


Sharing of resources 
CPU | Memory | Bandwidth 
Within application Yes | Yes | Yes 


Amongst applications | No | Yes | No 


In the VAR model, an application is available if at least one of its duplicates 
is on-line. A duplicate is on-line if none of the PMs and Physical Links (PLs), 
that contribute its placement, fail. Duplicates of the same application can share 
physical components. An advantage of this reuse is that a fine-grained tradeoff 
can be made between increased availability, and decreased resource consumption. 
An overview of resources’ reuse is shown in Table5. In Fig.9 three possible 
placement configurations using two duplicates are shown for one application. 
In Fig. 9a both duplicates are identical, and no redundancy is introduced. The 
nodal resource consumption is minimal, as CPU and memory for sı, s2, and 
83 are provisioned only once. Additionally, the total bandwidth required for 
(S1,S2), and (s2,53) is only provisioned once. The bandwidth consumption of 
this configuration might not be minimal, if consolidation of two or three services 
onto one PM is possible. This placement configuration does not provide any 
fault-tolerance, as failure of either nj, no or ng, or (n1, n2), (n2,n3) results in 
downtime. 

When more than one duplicate is placed and the resulting arrangements of 
VLs and services differ, then the placement is said to introduce redundancy. 
However, this increased redundancy results in a higher resource consumption. 
In Fig. 9b the application survives a singular failure of either (n4, n2), (n2, 3), 
(n4, ns), or (n5,n3). The placement configuration depicted in Fig. 9c survives all 
singular failures in the SN, except for a failure of nı. 
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Fig. 9. Illustration of the VAR protection method. 


Formal Problem Description. The algorithms presented in this work are 
based on the optimisation model proposed in [39]. In this section we briefly 
describe the model but refer to [39] for a more elaborate discussion. Our model 
consists of two main blocks: the cloud-environment and the set of applications. 
To model the problem we define the following constraints. We refer to [39] for 
the mathematical representation. 


— The total amount of duplicates for each application is limited by 6. 

— An application a is placed correctly if and only if at least one duplicate of a 
is placed. 

— A service is correctly placed if there is enough CPU and memory available in 
all PMs. 

— A service will only be placed on a PM if and only if it is used by at least one 
duplicate. 

— The total bandwidth of a PL cannot be higher than the aggregate bandwidth 
of the VLs that use the PL. 

— A VL can use a PL if and only if the PL has sufficient remaining bandwidth. 

— An application is only placed if the availability of the application can be 
guaranteed. 


If a service is placed on the same PM, for multiple duplicates or for multiple 
applications, or the same VL is placed on a PL, they can reuse resources (see 
Table 5). Therefore, if service s is placed twice on PM n for the same application 
then there is no need to allocate CPU and memory twice. Only if service s is 
placed for a different application additional CPU resources must be allocated. 
The problem we solve is to maximise the number of accepted applications. 


Results. For a description of the proposed heuristics, and an extensive perfor- 
mance analysis, featuring multiple application types, SN types and scalability 
study we refer the interested reader to [40]. 

In reliable cloud environments (or equivalently, under low availability require- 
ments) it is often acceptable to place each VN only once, and not bother about 
availability [27]. However, when the frequency of failures is higher (or if avail- 
ability requirements increase), then one of the following measures should be 
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taken. First, one can improve the availability by placing additional backups, 
which fail independently of one another. However, this approach works best in 
homogeneous cloud environments, where one can use the same number of backup 
VN embeddings, regardless of the exact placement configuration. In heteroge- 
neous environments a fixed redundancy level for each application either results 
in wasted SN resources, or a reduced placement ratio. In the context of cloud 
federation, the reliability of the links interconnecting the different cloud enti- 
ties can be highly heterogeneous (leased lines, or best-effort public internet). 
Therefore, to further improve revenue, cloud federation should take these failure 
characteristics into consideration, and estimate the required replication level. 


3.4 Level 2: Service Composition and Orchestration 


Service composition and orchestration have become the predominant paradigms 
that enable businesses to combine and integrate services offered by third par- 
ties. For the commercial viability of composite services, it is crucial that they are 
offered at sharp price-quality ratios. A complicating factor is that many attrac- 
tive third-party services often show highly variable service quality. This raises 
the need for mechanisms that promptly adapt the composition to changes in the 
quality delivered by third party services. In this section, we discuss a real-time 
QoS control mechanism that dynamically optimizes service composition in real 
time by learning and adapting to changes in third party service response time 
behaviors. Our approach combines the power of learning and adaptation with 
the power of dynamic programming. The results show that real-time service re- 
compositions lead to dramatic savings of cost, while meeting the service quality 
requirements of the end-users. 


3.4.1 Background and Motivation 


In the competitive market of information and communication services, it is cru- 
cial for service providers to be able to offer services at competitive price/quality 
ratios. Succeeding to do so will attract customers and generate business, while 
failing to do so will inevitably lead to customer dissatisfaction, churn and loss of 
business. A complicating factor in controlling quality-of-service (QoS) in service 
oriented architectures is that the ownership of the services in the composition 
(sub-services) is decentralized: a composite service makes use of sub-services 
offered by third parties, each with their own business incentives. As a conse- 
quence, the QoS experienced by the (paying) end user of a composite service 
depends heavily on the QoS levels realized by the individual sub-services run- 
ning on different underlying platforms with different performance characteristics: 
a badly performing sub-service may strongly degrade the end-to-end QoS of a 
composite service. In practice, service providers tend to outsource responsibil- 
ities by negotiating Service Level Agreements (SLAs) with third parties. How- 
ever, negotiating multiple SLAs in itself is not sufficient to guarantee end-to-end 
QoS levels as SLAs in practice often give probabilistic QoS guarantees and SLA 
violations can still occur. Moreover probabilistic QoS guarantees do not nec- 
essarily capture time-dependent behavior e.g. short term service degradations. 
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Therefore, the negotiation of SLAs needs to be supplemented with run-time 
QoS-control capabilities that give providers of composite services the capability 
to properly respond to short-term QoS degradations (real-time composite ser- 
vice adaptation). Motivated by this, in this section we propose an approach that 
adapts to (temporary) third party QoS degradations by tracking the response 
time behavior of these third party services. 


3.4.2 Literature and Related Work 

The problem of QoS—aware optimal composition and orchestration of composite 
services has been well-studied (see e.g. [41,42]). The main problem addressed in 
these papers is how to select one concrete service per abstract service for a given 
workflow, in such a way that the QoS of the composite service (as expressed by 
the respective SLA) is guaranteed, while optimizing some cost function. Once 
established, this composition would remain unchanged the entire life-cycle of 
the composite web service. In reality, SLA violations occur relatively often, lead- 
ing to providers’ losses and customer dissatisfaction. To overcome this issue, 
it is suggested in [43-45] that, based on observations of the actually realised 
performance, re-composition of the service may be triggered. During the re- 
composition phase, new concrete service(s) may be chosen for the given work- 
flow. Once re-composition phase is over, the (new) composition is used as long as 
there are no further SLA violations. In particular, the authors of [43-45] describe 
when to trigger such (re-composition) event, and which adaptation actions may 
be used to improve overall performance. 

A number of solutions have been proposed for the problem of dynamic, run- 
time QoS-aware service selection and composition within SOA [46-49]. These 
(proactive) solutions aim to adapt the service composition dynamically at run- 
time. However, these papers do not consider the stochastic nature of response 
time, but its expected value. Or they do not consider the cost structure, revenue 
and penalty model as given in this paper. 

In the next section, we extend the approach presented in [48] such that we 
can learn an exploit response-time distributions on the fly. The use of classical 
reinforcement-learning techniques would be a straight forward approach. How- 
ever, our model has a special structure that complicates the use of the classical 
Temporal Difference learning (TD) learning approaches. The solution of our DP 
formulation searches the stochastic shortest path in a stochastic activity net- 
work [50]. This DP can be characterized as a hierarchical DP [51,52]. Therefore 
classical Reinforcement Learning (RL) is not suitable and hierarchical RL has to 
be applied [52]. Also changes in response-time behavior are likely to occur which 
complicates the problem even more. Both the problem structure and volatility 
are challenging areas of research in RL. Typically RL techniques solve complex 
learning and optimization problems by using a simulator. This involves a Q value 
that assigns utility to state-action combinations. Most algorithms run off-line 
as a simulator is used for optimization. RL has also been widely used in on-line 
applications. In such applications, information becomes available gradually with 
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time. Most RL approaches are based on environments that do not vary over 
time. We refer to [51] for a good survey on reinforcement learning techniques. 

In our approach we tackle both the hierarchical structure, and time vary- 
ing behavior challenges. To this end we are using empirical distributions and 
updating the lookup table if significant changes occur. As we are considering a 
sequence of tasks, the number of possible response time realizations combina- 
tions explodes. By discretizing the empirical distribution over fixed intervals we 
overcome this issue. 


3.4.3 Composition and Orchestration Model 

We consider a composite service that comprises a sequential workflow consisting 
of N tasks identified by Tı,...,TẸ. The tasks are executed one—by—one in the 
sense that each consecutive task has to wait for the previous task to finish. Our 
solution is applicable to any workflow that could be aggregated and mapped 
into a sequential one. Basic rules for aggregation of non-sequential workflows 
into sequential workflows have been illustrated in, e.g. [48,50,53]. However, the 
aggregation leads to coarser control, since decisions could not be taken for a 
single service within the aggregated workflow, but rather for the aggregated 
workflow patterns themselves. 

The workflow is based on an unambiguous functionality description of a ser- 
vice (“abstract service”), and several functionally identical alternatives (“con- 
crete services”) may exist that match such a description [54]. Each task has an 
abstract service description or interface which can be implemented by external 
service providers. 

The workflow in Fig. 10 consists of four abstract tasks, and each task maps 
to three concrete services (alternatives), which are deployed by (independent) 
third—party service providers. For each task T; there are M; concrete service 
providers cs@) ine cs) available that implement the functionality corre- 
sponding to task T;. For each request processed by CS?) cost c/4) has to be 
paid. Furthermore there is an end-to-end response-time deadline 6,. If a request 
is processed within 6, a reward of R is received. However, for all requests that 
are not processed within 6, a penalty V had to be paid. After the execution of 
a single task within the workflow, the orchestrator decides on the next concrete 
service to be executed, and composite service provider pays to the third party 
provider per single invocation. The decision points for given tasks are illustrated 
at Fig. 10 by A, B, C and D. The decision taken is based on (1) execution costs, 
and (2) the remaining time to meet the end-to-end deadline. The response time 
of each concrete service provider CS“4 ) is represented by the random variable 
DJ), After each decision the observed response time is used for updating the 
response time distribution information of the selected service. Upon each lookup 
table update the corresponding distribution information is stored as reference 
distribution. After each response the reference distribution is compared against 
the current up-to date response time distribution information. 

In our approach response-time realizations are used for learning an updating 
the response-time distributions. The currently known response-time distribution 


Traffic Management for Cloud Federation 297 


Task 1 Task 2 Task 3 Task 4 


Request à Fg w Response 
EA k) rio === 
i m™ D 
© 


i Update CS s ag Update cs”). e PET can, % Tee Sima 


Fig. 10. Orchestrated composite web service depicted by a sequential workflow. 
Dynamic run-time service composition is based on a lookup table. Decisions are 
taken at points A-D. For every used concrete service the response-time distribution is 
updated with the new realization. In this example a significant change is detected. As 
a result for the next request concrete service 2 is selected at task 1. 


is compared against the response-time distribution that was used for the last 
policy update. Using well known statistical tests we are able to identify if an 
significant change occurred and the policy has to be recalculated. Our approach is 
based on fully dynamic, run-time service selection and composition, taking into 
account the response-time commitments from service providers and information 
from response-time realizations. The main goal of this run-time service selection 
and composition is profit maximization for the composite service provider and 
ability to adapt to changes in response-time behavior of third party services. 

By tracking response times the actual response-time behavior can be cap- 
tured in empirical distributions. In [48] we apply a dynamic programming (DP) 
approach in order to derive a service-selection policy based on response-time real- 
izations. With this approach it is assumed that the response-time distributions 
are known or derived from historical data. This results in a so called lookup table 
which determines what third party alternative should be used based on actual 
response-time realizations. 
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3.4.4 Real Time QoS Control 

In this section we explain our real-time QoS control approach. The main goal 
of this approach is profit maximization for the composite service provider, and 
ability to adapt to changes in response-time behavior of third party services. We 
realize this by monitoring/tracking the observed response-time realizations. The 
currently known empirical response-time distribution is compared against the 
response-time distribution that was used for the last policy update. Using well 
known statistical tests we are able to identify if an significant change occurred 
and the policy has to be recalculated. Our approach is based on fully dynamic, 
run-time service selection and composition, taking into account the response— 
time commitments from service providers and information from response-time 
realizations. We illustrate our approach using Fig. 11. The execution starts with 
an initial lookup table at step (1). This could be derived from initial measure- 
ments on the system. After each execution of a request in step (2) the empirical 
distribution is updated at step (3). A DP based lookup table could leave out 
unattractive concrete service providers. In that case we do not receive any infor- 
mation about these providers. These could become attractive if the response-time 
behavior changes. Therefore in step (4), if a provider is not visited for a certain 
time, a probe request will be sent at step (5b) and the corresponding empirical 
distribution will be updated at step (6a). After each calculation of the lookup 
table, the current set of empirical distributions will be stored. These are the 
empirical distributions that were used in the lookup table calculation and form 
a reference response-time distribution. Calculating the lookup table for every 
new sample is expensive and undesired. Therefore we propose a strategy where 
the lookup table will be updated if a significant change in one of the services 
is detected. For this purpose the reference distribution is used for detection of 
response-time distribution changes. In step (5a) and step (6a) the reference dis- 
tribution and current distribution are retrieved and a statistical test is applied 
for detecting change in the response-time distribution. If no change is detected 
then the lookup table remains unchanged. Otherwise the lookup table is updated 
using the DP. After a probe update in step (5b) and step (6b) we immediately 
proceed to updating the lookup table as probes are sent less frequently. In step 
(7) and step (8) the lookup table is updated with the current empirical distribu- 
tions and these distributions are stored as new reference distribution. By using 
empirical distributions we are directly able to learn and adapt to (temporarily) 
changes in behavior of third party services. 

Using a lookup table based on empirical distributions could result in the situ- 
ation that certain alternatives are never invoked. When other alternatives break 
down this alternative could become attractive. In order to deal with this issue 
we use probes. A probe is a dummy request that will provide new information 
about the response time for that alternative. As we only receive updates from 
alternatives which are selected by the dynamic program, we have to keep track 
of how long ago a certain alternative has been used. For this purpose to each 
concrete service provider a probe timer U4) is assigned with corresponding 


probe time-out t67 ). If a provider is not visited in tD) requests (UG) > t+ )) 
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Fig. 11. Real-time QoS control approach. 


then the probe timer has expired and a probe will be collected incurring probe 
cost clk ) Tf for example, in Fig. 10, the second alternative of the third task has 
not been used in the last ten requests, the probe timer for alternative two has 
value U@:?) = 10. After a probe we immediately update the corresponding dis- 
tribution. No test is applied here as probes are collected less frequent compared 
to processed requests. 

In order to evaluate the proposed QoS control methods we have performed 
extensive evaluation testing in an experimental setting. The results show that 
real-time service re-compositions indeed lead to dramatics savings in cost, while 
still meeting QoS requirements of the end users. The reader is referred to [55] 
for the details. 


3.5 Level 1: Resource Management in Virtualized Infrastructure 


Level 1 deals with the dependencies of different physical resources, such as Cen- 
tral Processing Unit (CPU) time, Random Access Memory (RAM), disk I/O, 
and network access, and their effect on the performance that users perceive. 
These dependencies can be described by functions that map resource combi- 
nations, i.e. resource vectors, to scalars that describe the performance that is 
achieved with these resources. Therefore, such utility functions describe how the 
combination of different resources influences the performance users perceive [56]. 
Accordingly, utility functions (a) indicate in which ratios resources have to be 
allocated, in order to maximize user satisfaction and efficiency, (b) are deter- 
mined by technical factors, and (c) are investigated in this section. 


3.5.1 Methodology 

In order to get an idea about the nature of utility functions that VMs have during 
runtime, dependencies between physical resources, when utilized by VMs, and 
effects on VM performance are investigated as follows. Different workloads are 
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executed on a VM with a changing number of Virtual CPUs (VCPU) and Virtual 
RAM (VRAM) (this influences how many physical resources the VM can access) 
and varying load levels of the host system (this simulates contention among VMs 
and also influences how many physical resources the VM can access). 

A machine with a 2.5 Gigahertz (GHz) AMD Opteron 6180 SE processor 
with 24 cores and 6 and 10 MB of level 2 and 3 cache, respectively, and 64GB 
of ECC DDR3 RAM with 1333 Mhz is used as host system. VM and host have 
a x86-64 architecture and run Ubuntu 14.04.2 LTS, Trusty Tahr, which was the 
latest Ubuntu release, when the experiments were conducted. 


3.5.1.1 Measurement Method. Resource consumption of VMs is measured by 
monitoring the VM’s (qemu [57]) process. In particular, the VM’s CPU time and 
permanent storage I/O utilization is measured with psutil (a python system and 
process utilities library) and the VM’s RAM utilization by the VM’s proportional 
set size, which is determined with the tool smem [58]. 


8.5.1.2 Workloads. Workloads are simulated by the following benchmarks of the 
Phoronix test suite [59]. 


Apache. This workload measures how many requests the Apache server can 
sustain concurrently. 

Aio-stress. This benchmark assesses the speed of permanent storage I/O (hard 
disk or solid state drive). In a virtualized environment permanent storage 
can be cached in the host system’s RAM. Therefore, this test not necessarily 
results in access to the host system’s permanent storage. 

7zip. This benchmark uses 7zip’s integrated benchmark feature to measure the 
system’s compression speed. 

PyBench. This benchmark measures the execution time of Python functions 
such as BuiltinFunctionCalls and NestedForLoops. Contrary to all other 
benchmarks, here a lower score is better. 


3.5.2 Results 
This section presents selected results from [60] that were achieved with the setup 
described above. 


3.5.2.1 RAM. Figure12 shows the scores a VM achieves on the Apache and 
PyBench benchmark and the RAM it utilizes depending on the VRAM. For 
each VRAM configuration 10 measurements are conducted. 

Figure 12a shows that when the VM executes Apache, it never utilizes more 
than 390 MB of RAM. In particular, for a VM with 100 to 350MB of VRAM 
the amount of RAM that is maximally utilized continuously increases but does 
not further increase, when more than 350MB of VRAM are added. Therefore, 
Fig. 12a shows that a VM with less than 350 MB of VRAM utilizes all RAM 
that is available, which seems to imply, that this amount of RAM is critical for 
performance. However, Fig. 12a also depicts that the Apache score only increases 
for up to 250 MB of VRAM and that this increase is marginal compared to the 
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Fig. 12. Benchmark scores and RAM utilization depending on a VM’s VRAM 


increase of RAM that is utilized. Therefore, the dependency between VRAM and 
utilized RAM is much stronger than the dependency between VRAM/utilized 
RAM and Apache score. In particular, while the RAM utilization more than 
doubles, the Apache scores vary by less than 10%. This is particularly interesting, 
because this configuration range includes 100 MB of VRAM which constrains the 
VM’s RAM utilization to less than half of what the VM alone (without executing 
any workload) would utilize. 

Figure 12b shows that when the VM executes PyBench, the VM process 
utilizes 270 MB of RAM at most. Although the VM is constraint in its RAM 
utilization, when it has less than 250MB of VRAM, there is no correlation 
between the achieved PyBench score and the VM’s VRAM, as the PyBench 
score does not increase. 

Therefore, Fig. 12 shows that RAM, which is actively utilized by a VM (be it 
on startup or when executing an application), not necessarily impacts the VM’s 
performance. In particular, even if the RAM utilized by a VM varies from 100 MB 
to 350 MB, the VM’s Apache score, i.e., its ability to sustain concurrent server 
requests, only changed by 10%. For PyBench the score was entirely independent 
of the available RAM. This is particularly interesting, because not even a VM 
with 100 MB of VRAM showed decreased performance, while this is the minimum 
amount of RAM that avoids a kernel panic and even a VM that not executes 
any workload utilizes more, if possible. 


8.5.2.2 VCPUs and Maximal RAM Utilization. The 7zip benchmark reveals an 
interesting dependency of VCPUs and RAM utilization (cf. Fig. 13). As Fig. 13a 
shows, for one to three VCPUs a VM executing the 7zip benchmark utilizes 
1GB of RAM and for every two additional cores the RAM utilization increases 
by 400 MB (the VM had 9GB of VRAM). 

The distinct pattern in which RAM is utilized gives reason to believe, that 
it is essential for performance. Therefore, Fig.13b compares the 7zip scores 
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achieved by VMs with 1 and 9 GB of VRAM. As Fig. 13a shows, the more VCPUs 
a VM has, the more it will be constrained by only having 1 GB of VRAM, while 
9GB of VRAM not even constrain a VM with 24 VCPUs. In line with this 
observation, Fig. 13b shows that the difference between the 7zip scores achieved 
by VMs with 1 and 9 GB of VRAM grows with the number of VCPUs. However, 
the score difference is rather moderate compared to the large difference in terms 
of RAM utilization. In particular, a VM with 24 VCPUs utilizes more than 5GB 
of RAM, if available. This is five times as much, as a VM with 1 GB of VRAM 
utilizes. However, the 7zip scores achieved by these VMs only differ by 15%. 


X 9 GB of VRAM X 9 GB of VRAM 
6 F + 1GB of VRAM 35 L + 1GBof VRAM yx% 
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Fig. 13. RAM utilization and performance, depending on the number of VCPUs and 
amount of VRAM, of a VM executing the 7zip benchmark 


8.5.2.8 Multi Core Penalty. Figure 14a plots the Apache scores achieved by a VM 
with 1 to 9 VCPUs, whereat 16 measurements per configuration were conducted. 
The figure shows that the best performance is achieved, when the VM has three 
or four VCPUs, while additional VCPUs linearly decrease the Apache score. As 
the figure depicts, up to three VCPUs significantly increase performance and four 
VCPUs perform equally well. However, adding additional VCPUs continuously 
decreases performance. This effect, which is termed multi-core-penalty occurred, 
independent of whether VCPUs were pinned to physical CPUs. Figure 14a also 
demonstrates that, while three VCPUs perform best for an unstressed host, two 
VCPUs perform best, when the host is stressed. Furthermore, the multi-core- 
penalty does not occur, when the benchmark is executed natively, i.e., directly 
on the host and not inside a VM. This shows that the it is caused by the vir- 
tualization layer. Despite the decrease of the Apache score with the number of 
VCPUs, the VM’s utilization of CPU time increases with the number of VCP Us. 
For example, for the Apache benchmark it was found that for 9 VCPUs the 
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utilized CPU time is roughly twice as high as the CPU time utilized by one to 
three VCPUs (although the Apache score was significantly lower for 9 VCPUs). 
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Fig. 14. Two example of the multi-core-penalty 


Figure 14b shows that the multi-core penalty also occurs for the aio-stress 
benchmark, where a VM with one VCPU constantly achieves a higher aio-stress 
score than any VM with more VCPUs. In particular, the aio-stress score of a 
VM with only one VCPU is on average a 30% higher than the aio-stress score of 
VMs with more VCPUs. However, unlike the Apache benchmark, the aio-stress 
score does not decrease with the number of VCPUs. 


3.5.3 New Findings 

Most work on data center resource allocation assumes that resources such as 
CPU and RAM are required in static or at least well defined ratios and that 
the resulting performance is clearly defined. The results of this section do not 
confirm these idealistic assumptions. 

Section 3.5.2 did not find any significant effect of a VRAM on VM perfor- 
mance. Notably, even for workloads that seem to be RAM critical, as they utilize 
RAM in distinct patterns, or workloads running on VMs with just enough VRAM 
to avoid a kernel panic during boot, no significant effect was found. Even if a lack 
of RAM impedes performance, the impediment is minor compared to the amount 
of RAM that is missing (cf. Sect. 3.5.2). In contrast, a lack of RAM bandwidth 
significantly effects performance [61] but is rarely considered, when investigat- 
ing data center fairness. Section 3.5.2 showed that the amount of RAM that is 
utilized by a VM may depend on the number of VCPUs. Section 3.5.2 presents 
the most counter-intuitive finding, which is that, when multi-core benchmarks 
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are executed inside a VM, the performance often decreases, when more VCPUs 
are added to the VM. 

This section showed that it is a complex task to determine a class of utility 
functions that properly models the allocation of a node’s PRs to VMs. However, 
a realistic class of utility functions would greatly aid cloud resource allocation, 
as it would allow to theoretically determine allocations that are practically more 
efficient. Therefore, positive results on this topic would also greatly aid the per- 
formance of cloud federations, as it would also allow to execute tasks in the 
cloud of a federation, that performs best for this task. Nonetheless, no work 
exists on this topic. This lack of work is caused by the topic’s complexity. For 
example, resource dependencies vary over time, and depend on the workload 
that is executed inside a VM and the host’s architecture. Also, the performance 
of a VM is determined by a combination of resources as diverse as CPU time, 
RAM, disk I/O, network access, CPU cache capacity, and memory bandwidth, 
where substitutabilities may or may not apply. 


4 Cloud Federation for IoT 


4.1 State-of-the-Art in IoT Cloud Research 


The integration of IoT and clouds has been envisioned by Botta et al. [62] by 
summarizing their main properties, features, underlying technologies, and open 
issues. A solution for merging IoT and clouds is proposed by Nastic et al. [63]. 
They argued that system designers and operations managers faced numerous 
challenges to realize IoT cloud systems in practice, due to the complexity and 
diversity of their requirements in terms of IoT resources consumption, customiza- 
tion and runtime governance. They also proposed a novel approach for IoT cloud 
integration that encapsulated fine-grained IoT resources and capabilities in well- 
defined APIs in order to provide a unified view on accessing, configuring and 
operating IoT cloud systems, and demonstrated their framework for managing 
electric fleet vehicles. 

Atzori et al. [64,65] examined IoT systems in a survey. They identified many 
application scenarios, and classified them into five application domains: trans- 
portation and logistics, healthcare, smart environments (home, office, plant), 
personal, social and futuristic domains. They described these domains in detail, 
and defined open issues and challenges for all of them. Concerning privacy, they 
stated that much sensitive information about a person can be collected without 
their awareness, and its control is impossible with current techniques. 

Escribano [66] discussed the first opinion [67] of the Article 29 Data Pro- 
tection Working Party (WP29) on IoT. According to these reports four cate- 
gories can be differentiated: the first one is wearable computing, which means 
the application of everyday objects and clothes, such as watches and glasses, in 
which sensors were included to extend their functionalities. The second category 
is called the ‘quantified self things’, where things can also be carried by individ- 
uals to record information about themselves. With such things we can examine 
physical activities, track movements, and measure weight, pulse or other health 
indicators. The third one is home automation, which covers applications using 
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devices placed in offices or homes such as connected light bulbs, thermostats, 
or smoke alarms that can be controlled remotely over the Internet. They also 
mention smart cities as the fourth category, but they do not define them explic- 
itly. They argue that sharing and combining data through clouds will increase 
locations and jurisdictions, where personal data resides. Therefore it is crucial to 
identify and realize which stakeholder is responsible for data protection. WP29 
named many challenges concerning privacy and data protection, like lack of user 
control, intrusive user profiling and communication and infrastructure related 
security risks. 

IoT application areas and scenarios have already been categorized, such as 
by Want et al. [68], who set up three categories: Composable systems, which 
are ad-hoc systems that can be built from a variety of nearby things by making 
connections among these possibly different kinds of devices. Since these devices 
can discover each other over local wireless connections, they can be combined 
to provide higher-level capabilities. Smart cities providing modern utilities could 
be managed more efficiently with IoT technologies. As an example traffic-light 
systems can be made capable of sensing the location and density of cars in the 
area, and optimizing red and green lights to offer the best possible service for 
drivers and pedestrians. Finally, resource conservation scenarios, where major 
improvements can be made in the monitoring and optimization of resources 
such as electricity and water. 


4.2 MobloTSim for Simulating IoT Devices 


Cloud Federation can help IoT systems by providing more flexibility and scala- 
bility. Higher level decisions can be made on where to place a gateway service to 
receive IoT device messages, e.g. in order to optimize resource usage costs and 
energy utilization. Such complex IoT cloud systems can hardly be investigated 
in real world, therefore we need to turn to simulations. 

The main purpose of MobloTSim [69], our proposed mobile IoT device simu- 
lator, is to help cloud application developers to learn IoT device handling without 
buying real sensors, and to test and demonstrate IoT applications utilizing mul- 
tiple devices. The structure of the application lets users create IoT environment 
simulations in a fast and efficient way that allows for customization. 

MoblIoTSim can simulate one or more IoT devices, and it is implemented 
as a mobile application for the Android platform. Sensor data generation of 
the simulated devices are random generated values in the range given by the 
user, or replayed data from trace files. The data sending frequency can also be 
specified for every device. The application uses the MQTT protocol to send data 
with the use of the Eclipse Paho opensource library. The data is represented in 
a structured JSON object compatible with the IBM IoT Foundation message 
format [70]. 

The basic usage of the simulator is to (i) connect to a cloud gateway, where 
the data is to be sent, (ii) create and configure the devices to be simulated 
and (iii) start the (data generation of the) required devices. These main steps 
are represented by three main parts of the application: the Cloud settings, the 
Devices and the Device settings screens. In the Cloud settings screen, the user 
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can set the required information about the targeted cloud, where the data will be 
received and processed. Currently there are two types of clouds supported: IBM 
Bluemix and MS Azure. For the IBM cloud we have two options: the Bluemix 
quickstart and the standard Bluemix IoT service. The Bluemix quickstart is a 
public demo application, it can visualise the data from a selected device. For 
a fast and easy setup (i.e. to try out the simulator) this type is recommended. 
The standard Bluemix IoT service type can be used if the user has a registered 
account for the Bluemix platform, and already created an IoT service. This IoT 
service can be used to handle devices, which have been registered before. The 
main part of the IoT service is an MQTT broker, this is the destination of the 
device messages, and it forwards them to the cloud applications. Such cloud 
applications can process the data, react to it or just perform some visualisation. 
The required configuration parameters for the standard Bluemix IoT service in 
MobloTSim are: the Organization ID, which is the identifier of the IoT service 
of the user in Bluemix, and an authentication key, so that the user does not 
have to register the devices on the Bluemix web interface, and the command 
and event IDs, which are customizable parts of the used MQTT topics to send 
messages from the devices to the cloud and vice versa. MobloTSim can register 
the created devices with these parameters automatically, by using the REST 
interface of Bluemix. 

The Devices screen lists the created devices, where every row is a device or 
a device group. These devices can be started and stopped by the user at will, 
both together or separately for the selected ones. Some devices have the ability 
to display warnings and notifications sent back by a gateway. In this screen we 
can also create new devices or device groups. There are some pre-defined device 
templates, which can be selected for creation. These device templates help to 
create often used devices, such as a temperature sensor, humidity sensor or a 
thermostat. If the user selects a template for the base of the device, the message 
content and frequency will be set to some predefined values. The Thermostat 
template has a temperature parameter, it turns on by reaching a pre-defined 
low-level value and turns off at the high-level value. The On/Off state of the 
device is displayed all the time. It is possible to select the Custom template to 
configure a device in detail. 

The new device creation and the editing of an existing one are made in the 
Device settings screen. The user can add more parameters to a device and can 
customize it with its own range. The range will be used to generate random 
values for the parameters. A device group is a group of devices with the same 
base template and they can be started and stopped together. If a device wants 
to send data to the Bluemix IoT service, it has to be registered beforehand. The 
registered devices have device IDs and tokens for authentication. The MobloT- 
Sim application handles the device registration in the cloud with REST calls, 
so the user does not have to register the devices manually on the graphical web 
interface. There is an option to save the devices to a file and load them back to 
the application later. The device type attribute can be used to group devices. 
The simulation itself can also be saved, so the randomly generated data can be 
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replayed later many times. Even trace files from real world applications can be 
played from other sources, i.e. saved samples from the OpenWeatherMap public 
weather data provider [71]. The OpenWeatherMap monitors many cities and 
stores many parameters for them, including temperature, humidity, air pressure 
and wind speed. Using this trace loader feature, the simulation becomes closer 
to a real life scenario. In some cases, the user may want to send data to not just 
one but more cloud gateways at the same time. This is also possible by changing 
the organization ID attribute of a device to one of the already saved ones in the 
cloud settings. 

We modified the Bluemix visualisation application to create a new private 
gateway to handle more than one device at the same time. In this way we can see 
the data from all devices in a real time chart. The node.js application subscribes 
to all device topics with the MQTT protocol, and waits for the data. In this 
revised gateway we use paging to overcome device management limitations (25 
devices at a time). In order to enhance and better visualize many device data at 
the same time, we introduced device grouping for the chart generation. 

To summarize, MobloTSim together with the proposed gateways provide 
a novel solution to enable the simulation and experimentation of IoT cloud 
systems. Our future work will address extensions for additional thing and sensor 
templates, and will provide cases for scalability investigations involving multiple 
cloud gateways. 


5 Summary 


In this chapter we have reported activities of the COST IC1304 ACROSS Euro- 
pean Project corresponding to traffic management for Cloud Federation. In 
particular, we have provided survey of discussed CF architectures and corre- 
sponding standardization activities, we have proposed comprehensive multi-level 
model for traffic management for CF together with proposed solutions for each 
level. The effectiveness of these solutions were verified by simulation and ana- 
lytical methods. The proposed levels are: Level 5 - Strategies for building CF, 
Level 4 - Network for CF, Level 3 - Service specification and provision, Level 2 - 
Service composition and orchestration, Level 1 - Task service in cloud resources. 
Finally, we have presented specialized simulator for testing CF solution in IoT 
environment. 
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Abstract. In the paradigm of Internet of Things (IoT), sensors, actu- 
ators and smart devices are connected to the Internet. Application 
providers utilize the connectivity of these devices with novel approaches 
involving cloud computing. Some applications require in depth analysis 
of the interaction between IoT devices and clouds. Research in this area 
is facing questions like how we should govern such large cohort of devices, 
which may easily go up often to tens of thousands. In this chapter we 
investigate IoT Cloud use cases, and derive a general IoT use case. Dis- 
tributed systems simulators could help in such analysis, but they are 
problematic to apply in this newly emerging domain, since most of them 
are either too detailed, or not extensible enough to support the to be 
modelled devices. Therefore we also show how generic IoT sensors could 
be modelled in a state of the art simulator using our generalized case 
to exemplify how the fundamental properties of IoT entities can be rep- 
resented in the simulator. Finally, we validate the applicability of the 
introduced IoT extension with a fitness and a meteorological use case. 


Keywords: Internet of Things - Cloud computing - Simulation 


1 Introduction 


The Internet of Things (IoT) groups connected sensors (e.g. heart rate, heat, 
motion, etc.) and actuators (e.g. motors, lighting devices) allowing for automated 
and customisable systems to be utilised [8]. IOT systems are currently expanding 
rapidly as the amount of smart devices (sensors with networking capabilities) is 
growing substantially, while the costs of sensors decreases. 

IoT solutions are often used a lot within businesses to increase the perfor- 
mance in certain areas and allow for smarter decisions to be made based on 
more accurate and valuable data. Businesses have grown to require IoT systems 
to be accurate as decisions based on their data is relied on heavily. An example 
of IoT in industry is the tracking of parcels for delivery services. The system 
can provide users with real time information of where their parcel currently is 
and notify them of potential arrival times. This requires a large infrastructure 
to facilitate as there is a lot of data being produced. 
© The Author(s) 2018 
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Many sensors have different behaviour. For example, a heart rate sensor has 
different behaviour to a light sensor in that a heart rate sensor relies on human 
behaviour which is inheritably unpredictable, whereas a light sensor could be 
predicted quite accurately based on the time of day/location. Predicting how 
a sensor may impact a system is important as companies generally want to 
leverage the most out of an IoT system however an incorrect estimation of the 
performance impact can damage the performance of other systems (e.g. using 
too many sensors could flood the network, potentially causing inaccurate data, 
slow responses, or system crashes). As there are many ways a sensor can behave 
it is difficult to predict the impact they may have on a scalable system, therefore 
they must be tested to determine what the system can handle. Performing this 
testing could be costly, time consuming, and high risk if the infrastructure has 
to be created and a wide range of sensors are purchased before any information 
is obtained about the system. It is even more difficult to determine the impact of 
a prototype system on the network as there may limited or no physical sensors 
to perform tests with. An example of this is the introduction of soil moisture 
sensors that analyse soil in real time and adjust water sprinklers to ensure crops 
have the correct conditions to grow. In order to test this loT system effectively, 
a lot of these sensors are required, however they can become quite costly and 
difficult to implement. 

There are cloud simulators that provide the tools required to perform a cus- 
tomised simulation of an IoT system which can somewhat accurately simulate 
the performance impact that a particular setup may have on an infrastructure. 
The issue with simulators is that due to the wide range of sensor behaviours, 
to be useful to a wide range of people the simulators cannot be too specific and 
instead rely on extensions to be implemented in order to function. This requires 
a lot of specialised code (Such as the sensor’s behaviour and the network infras- 
tructure) to be implemented on top of the chosen simulator which can take a 
lot of time and may have to be altered frequently when situations change. This 
limits the simulators application as it demands programming skills, a lot of time, 
and a firm understanding of the API. 

In this research work we develop extensions for the DISSECT-CF [5] simula- 
tor, which already has the ability to model cloud systems, and has the potential 
to provide accurate representation of IoT systems. Therefore the goal of this 
research is to: (i) investigate IoT Cloud use cases, and (ii) derive a general IoT 
use case. We also show (iii) how generic IoT sensors could be modelled in a state 
of the art simulator using our generalized case to exemplify how the fundamen- 
tal properties of IoT entities can be represented in the simulator. Finally, we 
(iv) validate the applicability of the introduced IoT extension with a fitness and 
a meteorological use case. 

The remainder of this paper is as follows: Sect. 2 presents related work, and in 
Sect. 3, we detail our proposal for a general use case. In Sects. 5 and 4 we discuss 
two concrete applications, and the contributions are summarised in Sect. 6. 


Efficient Simulation of IoT Cloud Use Cases 315 


2 Related Work 


There are many simulators available to examine distributed and specifically cloud 
systems. These existing simulators are mostly general network simulators, e.g. 
Qualnet [1] and OMNeT++ [14]. With these tools IoT-related processes can 
be examined such as device placement planning and network interference. The 
OMNeT-++ discrete event simulation environment [14] is one of these examples, 
and it can be used in numerous domains from queuing network simulations to 
wireless and ad-hoc network simulations, from business process simulation to 
peer-to-peer network, optical switch and storage area network simulations. 

There are more specific IoT simulators, which are closer to our approach. As 
an example, Han et al. [4] have designed DPWSim, which is a simulation toolkit 
to support the development of service-oriented and event-driven IoT applications 
with secure web service capabilities. Its aim is to support the OASIS standard 
Devices Profile for Web Services (DPWS) that enables the use of web services on 
smart and resource-constrained devices. SimIoT [13] is derived from the SimIC 
simulation framework [12], which provides a deeper insight into the behavior 
of IoT systems, and introduces several techniques that simulates the communi- 
cation between an IoT sensor and the cloud, but it is limited by its compute 
oriented activity modeling. 

Moschakis and Karatza [9] have introduced several simulation concepts to 
be used in IoT systems. They showed how the interfacing of the various cloud 
providers and IoT systems could be modeled in a simulation. They also provided 
a novel approach to apply IoT related workloads, where data is gathered and 
processed from sensors taking part in the IoT system. Unfortunately, their work 
do not consider actuators, and they rather focus on the behavior of cloud systems 
that support the processing of data originated from the IoT world. The dynamic 
nature of IoT systems is addressed by Silva et al. [11]. They investigate fault 
behaviors and introduce a fault model to these systems. Although faults are 
important for IoT modeling, the scalability of the introduced fault behaviors 
and concepts are not sufficient for investigating large scale systems that would 
benefit from decentralized control mechanisms. 

Khan et al. [6] introduce a novel infrastructure coordination technique that 
supports the use of larger scale IoT systems. They build on CloudSim [3], which 
can be used to model a community cloud based on residential infrastructures. 
On top of CloudSim they provide customizations that are tailored for their 
specific home automation scenarios and therefore limit the applicability of their 
extensions for evaluating new IoT coordination approaches. These papers are 
also limited on sensors/smart objects thus not allowing to evaluate a wide range 
of IoT applications that are expected to rise to widespread use in the near future. 
Zeng et al. [15] proposed IOTSim that supports and enables simulation of big 
data processing in IoT systems using the MapReduce model. They also presented 
a real case study that validates the effectiveness of their simulator. 

In the field of resource abstraction for IoT, good efforts have been made 
towards the description and implementation of languages and frameworks for effi- 
cient representation, annotation and processing of sensed data. The integration 
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Fig. 1. Model elements of IoT use cases 


of IoT and clouds has been envisioned by Botta et al. [2] by summarizing their 
main properties, features, underlying technologies, and open issues. A solution for 
merging IoT and clouds is proposed by Nastic et al. [10]. They argue that system 
designers and operations managers face numerous challenges to realize IoT cloud 
systems in practice, due to the complexity and diversity of their requirements in 
terms of IoT resources consumption, customization and runtime governance. We 
generally share these views in this work, and build on these results by specifying 
our own contribution in the field of IoT Cloud simulations. 


3 General IoT Extension for Cloud Simulators 


The following section provides a small selection of use cases that display a wide 
range of behaviours, communication models, and data flows. A wide scope of use 
cases can provide a much better understanding of the drawbacks with current 
simulation solutions and will allow us to gain an insight into how we can find 
a common ground between them. This list is only a partial selection of possible 
use cases as they were selected based on the potential differences they may have, 
together building a fairly large pool of behavioural patterns after which intro- 
ducing more use cases would have had little impact on the overall experiment. 
The use case figures primarily display data flows (With minor context actions 
when necessary) as they provide an accurate enough description of the system to 
understand its behaviour and because simulators generally work via modelling 
the data transactions between entities. 

In Fig. 1 we introduce the basic elements of a generic IoT use case. We use 
these notations to represent certain properties and elements of these systems. 
Next we list and define these elements: 


— Entity/Entity Type. The entity box symbolises a physical device with some 
form of processing or communication powers. We have split the entities into 
3 categories: Sensors, Gateway and Server. 
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Table 1. Use case feature requirements 


Use cases 


1. Meteorological 
analysis 


Trace model 


v 


Trace replay 


Custom device 


v 


Responsive device 


v 


2. Automated waste 
management systems 


v 


3. Real time industrial 
water contamination 
system 


v 


4. Automated car 
parking space detector 


5. Vehicle black box 
insurance system 


v 


6. Fitness watch 
activity tracker 


7. Smartphone step 
counter 


Process. The Process circle represents some form of data processing within 
the linked Entity. It is used to symbolise the transformation, testing, and/or 
checking of data flows to produce either more data flows, or a contextual event 
to trigger. An example of this function can be the interpretation of analog 
input data from a sensor into something usable. 

Action. The Action circle simply represents a contextual event which generally 
comes in the form of a physical event. Actions usually require some form of 
data processing in order to trigger and thus are mostly used at the end of a 
data flow process. An example of this is a smartphone notification displaying 
a message from a cloud service. 

Data Store. The Data Store is used primarily by gateways and servers and 
symbolises the physical disk storage that a device might read/write to. 
Although this isn’t necessary to model, it may help understand some of the 
diagrams as to where the data may be coming from (As sometimes the data 
stores are used as a buffer to hold the data). 

Data Transactions. Data Transactions display the movement of data between 
entities and processes via a range of methods. A Physical Data Transaction 
refers to a direct link that entities and processes may have, such as a wired 
connection. Alternatively Bluetooth and Network transactions are differen- 
tiated to assist get understanding of how links are formed (To give a small 
reflection in the distances that can be assumed. Bluetooth having a shorter 
range than a network transaction). 


In Table 1 we gathered the basic feature requirements of representative IoT 


use cases. We have identified 4 requirements to be supported by simulations 
focusing on IoT device behaviour: 
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Fig. 2. The architecture of DISSECT-CF, showing the foundations for our extensions 


Trace model. Allow device behaviour to be characterised by its statistical 
properties (e.g., distribution functions and their properties like mean, median 
data packet size, communication frequency etc.). 

Trace replay. Let devices behave according to real-life recordings from the 
past. Here we expect devices to be defined with pointers to trace files that 
contain network, storage and computing activities in a time series. 

Custom device. In general, we expect that most of the simulations could be 
described by fulfilling the above two requirements. On the other hand, if the 
built in behaviour models are not sufficient, and there are no traces available, 
the simulation could incorporate specialised device implementations which 
implement the missing models. 

Responsive device. We expect that some custom devices would react to the 
surrounding simulated environment. Thus the device model is not exclusively 
dependent on the internals of the device, but on the device context (e.g., 
having a gateway that can dynamically change its behaviour depending on 
the size of its monitored sensor set). 


Based on these requirements, we examined seven cases ranging from smart region 
down to smart home applications. We chose to examine these cases by means of 
simulations, and we will focus on two distinguished cases further on: cases no. 
1. and 6. 

DISSECT-CF [5] is a compact, highly customizable open source! cloud sim- 
ulator with special focus on the internal organization and behavior of IaaS sys- 
tems. Figure 2 presents its architecture. It groups the major components with 
dashed lines into subsystems. There are five major subsystems implemented inde- 
pendently, each responsible for a particular aspect of internal IaaS functionality: 
(i) event system — for a primary time reference; (ii) unified resource sharing — 
to resolve low-level resource bottleneck situations; (iii) energy modeling — for 
the analysis of energy-usage patterns of individual resources (e.g., network links, 
CPUs) or their aggregations; (iv) infrastructure simulation — to model physical 


' Available from: https: //github.com/kecskemeti/dissect-cf. 
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Fig. 3. 1. Use case: meteorological application 


and virtual machines as well as networked entities; and finally (v) infrastructure 
management — to provide a real life cloud like API and encapsulate cloud level 
scheduling. 

As we aim at supporting the simulation of several thousand (or even more) 
devices participating in previously unforeseen IoT scenarios, or possibly existing 
systems that have not been examined before in more detail (e.g. in terms of 
scalability, responsiveness, energy efficiency or management costs). Since the 
high performance of a simulator’s resource sharing mechanism is essential, we 
have chosen to use the DISSECT-CF simulator, because of its unified resource 
sharing foundation. Building on this foundation, it is possible to implement the 
basic constructs of IoT systems (e.g., smart objects, sensors or actuators) and 
keep the performance of the past simulator. 

The proposed extension provides a runnable Application interface that can 
take an XML file defining the Machine Data (Such as Physical Machines, Reposi- 
tories, and their Connection data) and an XML file defining the Simulation Data 
(Such as the Devices and their behaviours). The Simulation Data can contain a 
scalable number of Devices and each device has its own independent behaviour 
model defined. The behaviour of the Device can be modelled in a combination 
of 3 ways; a direct link to a Trace File (Which should contain the target device, 
timestamp, and data size), a Trace Producer Model which contains the Distribu- 
tion set to produce an approximation of the device trace, or finally the simulator 
can accept device extensions which allow custom devices to be included in the 
source to programmatically model more specific behaviours. 
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Fig. 4. 6. Use case: fitness tracker application 


3.1 1. Use Case: Meteorological Application 


In Fig. 3 we reveal the typical data flow of a weather forecasting service. This 
application aims to make weather analysis more efficient by allowing the purchase 
of a small weather station kit including light sensors (to potentially capture 
cloud coverage), wind sensors (to collect wind speed), and temperature sensors 
(to capture the current ambient temperate). The weather station will then create 
a summary of the sensors findings over a certain period of time and report it to 
a Cloud service for further processing such as detecting hurricanes or heat waves 
in the early stages. If many of these stations are set up over a region, it can 
provide accurate and detailed data flow to the cloud service to produce accurate 
results. 

In order to simulate this application, the simulator need to provide appro- 
priate tools for performing the communications and processing, defining the 
behaviour of the sensors and the weather station require a modelling technique 
to be implemented on top of the simulator (which was achieved by programming 
the sensors data production and the stations buffer reporting). 


3.2 6. Use Case: Fitness Tracking Application 


In Fig. 4 we reveal the data flow typically encountered when wearables or fitness 
trackers like fitbit are used. This use case aims to track and encourage the activity 
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of a user by collecting a wide range of data about the user (Such as current 
heart rate, step count, floors climbed, etc.). This data is generally collected 
by the wearable device and sent to the smart phone when the user accesses the 
smartphone applications and requests the devices to synchronise, after which the 
data will then be synced from the smartphone to the cloud as well, for more data 
processing (which could result in trophies and milestones encouraging further use 
of the wearable). 

This provides an interesting range of behaviour as it contains a feedback 
mechanism to provide incentive to the user to perform specific actions based on 
certain circumstances. This is displayed within the Trophy and Milestone system 
that is implemented server side that will track certain metrics (such as average 
time being active daily) and provide notifications when they are reaching a goal 
(like a daily milestone of 1h active per day). 

This mechanism introduces an important behaviour model whereby the sen- 
sors produce data that can trigger events that indirectly change the behaviour 
of the sensors via a feedback loop. An example of this feedback loop can be 
the daily activity milestone whereby a user may perform 45 min of activity and 
decide to take a rest, at this point the sensors will revert back to their baseline 
behaviour (user is inactive therefore the sensors provide less data), however the 
system notifies the user that only an extra 15min is necessary to reach their 
milestone (the feedback), and thus the user may decide they want to hit their 
target and perform more activity which will then change the behaviour of the 
sensors yet again. 

It would be difficult to simulate this case via modelling strategies as the 
feedback mechanism combined with the unpredictable and wide ranging human 
activity (most users will have different times that they are active, levels of inten- 
sity, and duration of exercise) have too many variables to take into considera- 
tion. There is also the consideration of the time of day being a large factor to the 
behaviour of the sensor, as it can be expected that the sensor will provide far less 
activity data during the night when the user is likely sleeping when compared 
to the day time. This is further compounded by time zone differences whereby 
if the system is used in multiple time zones it would be harder to model due to 
differences in when a user base may be asleep or not. 

Due to the above reasons it would be required that a wide range of traces 
were collected in order to be able to obtain a large enough sample size of differ- 
ent behaviour models to run an accurate simulation of the system (which could 
be scaled up/down as required). This introduces problems with current simula- 
tor solutions as not only is replay functionality needed, but there must be the 
possibility of replaying several different traces simultaneously in order to test a 
system with the multitude of different behavioural models that can be expected 
(As there would be no point in running a simulation of a single behaviour model 
considering the real world application is vastly different). 
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4 Implementing the Extension for a Meteorological 
Application 


Based on the generic plans discussed before, we performed the extension of the 
DISSECT-CF simulator towards a meteorological application covering a wider 
region. To derive the sensor models for the extension, we started by modelling 
a real-world IoT system: as one of the earliest examples of sensor networks are 
from the field of meteorology and weather prediction, we choose to model the 
crowdsourced meteorological service of Hungary called Idokep.hu. It has been 
established in 2004, and it is one of the most popular websites on meteorology 
in Hungary. Since 2008 weather information can be viewed on Croatia and even 
on Germany. Detailed information of its system architecture and operation can 
also be found on the website: more than 400 stations send sensor data to their 
system (including temperature, humidity, barometric pressure, rainfall and wind 
properties), and the actual weather conditions are refreshed every 10 min. They 
also provide forecasts up to a week. They also produce and sell sensor stations 
capable to extend their sensor network and improve their weather predictions. 
These can be bought and installed at buyer specific locations. 

We followed a bottom-up approach to add IoT functionalities to the simulator, 
and implemented a weather prediction application using public data available on 
sensors and their behaviour at http://www.idokep.hu. 

Each entity that aims to perform repeated events in DISSECT-CT has to 
use the Timed class (see Fig. 2), by implementing the tick() method. We added 
two of such classes, the Application and the Station. The Station is an entity 
acting as a gateway. I.e., it provides the network connection for sensors, and 
optimises the network usage of the sensors by caching and bundling outgoing 
metering data of its supervised sensors. Figure 5 depicts how data stored about 
each station in an IoT system. This description is useful to set up predefined 
stations from files. The tasksize attribute of Application defines the amount 
of data (in bytes) to be gathered in a cloud storage (sent by the stations) before 
their processing in a VM. 

Stations have unique identifiers (i.e., a name). We can specify their lifetime 
with the tag time by defining their starttime and stoptime. The cardinality of 
the supervised sensor set is set via sbnumber. Alongside the set cardinality, one 
can also specify the average data size produced by one of the sensors in the set. 
To set up more stations with the same properties, one can use the count option 
in the name tag. Data generation frequency (freq) could be set for the sensor 
set (in milliseconds). The station’s caching mechanism is influenced with the tag 
ratio. This defines the amount of data to be kept at the local storage relative 
to the average dataset produced by the sensors at each data generation event. 
If the unsent data in the local storage (which is defined in storage) overreaches 
the caching limit, the station is modelled to send the cached items to the cloud’s 
storage (identified with its network node id specified in the torepo tag). The local 
storage is also keeping a log of previously sent data until its capacity (defined in 
the storage tag) is exceeded. The station’s network connectivity to the outside 
world is specified by the tags maxinbw and maxoutbw. 
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<Application tasksize=’250000’> 


<Station> 
<name count=’1’>Szeged</name> 
<freq>60000</freq> 


<snumber size=’200’>10</snumber> 
<time starttime=’500’ 
stoptime=’1000’> 
1000 

</time> 
<maxinbw>100</maxinbw> 
<maxoutbw>100</maxoutbw> 
<storagebw>100</storagebw> 
<torepo>sztakilpdsceph</torepo> 
<storage>60000</storage> 
<ratio>1</ratio> 

</Station> 

</Application> 


Fig. 5. XML-based description of IoT systems 


Individual Station entries in the XML are saved in the StationData java bean. 
The actual data generation of the sensors is performed by the Metering class. 

The Cloud class can be used to specify and set up a cloud environment. This 
class uses DISSECT-CF’s XML based cloud loader to set up a cloud environment 
to be used for storing and processing data from stations. This class should also 
be used to define Virtual Appliances modeling the application binaries doing the 
in cloud processing. 

The scenarios to be examined through simulations should be defined by the 
Application class. Users are expected to implement custom IoT Cloud use cases 
here by examining various management and processing algorithms of sensor data 
in VMs of a specific cloud environment. The VmCollector class can be used to 
manage such VMs, and its VmSearch() method can be used to check if there is 
a free VM available in the cloud to be utilized for a certain task. If this is not 
the case, the generateAndAdd() method can be used to deploy a new one. 


4.1 Implementation with the Generic IoT Oriented Extensions 


The weather station’s caching behaviour is a prime example for the need of respon- 
sive device implementations. As the sensors produce data independently from 
each other, and they could have varying frequencies and data sizes, the station 
must cache all produced data before sending it to the cloud for processing. This 
behaviour was modelled as a custom, responsive device for which we overrode the 
tick() function of our new device sub-class. In DISSECT-CF terminology, this 
function is the one that is used to represent periodic events in the simulation, 
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Fig. 6. Analysis of the buffering behaviour in the alternative simulations of a weather 
station 


in this particular case it was used to simulate the data reporting requests from 
the cloud. Each station has connections to its 8 sensors, which produced randomly 
sized data with the frequency of las — 1] Hz. Upon every tick call, our custom device 
determines if there is a need to send its buffered contents to the cloud or not. This 
is based on the buffered data size that was set to be at least 1 kB before emptying 
the buffer. 

The implementation was tested by running the original and the new imple- 
mentations side-by-side so that we could analyse the network traffic differences. 
Due to the random nature of the data production the two solutions don’t com- 
pletely line up, however Fig. 6 displays how the simulation extension produces 
a very similar result to the original implementation in that although there is 
a lot of randomness to the investigated scenario, the mean and median values 
are having a close match. The distribution is also following the same pattern: 
whereby the bulk of the buffer loads are within 1600 bytes and are less frequent 
the further away from this value it goes. 

At it can be observed, the basic extensions described here are mainly focusing 
on device behaviour. The application level operations are completely up to the 
user to define. E.g., application logic for how many virtual machines do we need 
for processing the sensor data is not to be described by the XML descriptors. In 
the next sub-section we will discuss such situations and explore how to combine 
application level behaviour with the new sensor and device models. 
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4.2 Evaluation with Alternative Application Level Scenarios 


During our implementation and evaluation, where applicable, we used publicly 
available information to populate our experiments. Unfortunately, some details 
are unpublished (e.g. sensor data sizes, data-processing times), for those, we have 
provided estimates and listed them below. 

In the website of Idokep.hu?, we learnt that the service operates with 487 sta- 
tions. Each of them has sensors at most monitoring the following environmental 
properties: 


timestamp; 

air and dew point temperature — °C; 
humidity — %; 

barometic pressure — in hPa; 

rainfall — mm/hour and mm/day; 
wind speed — km/h; 

wind direction; 

and UV-B level. 


SNOOP WNP 


Concerning the size of such sensor data, we expect them to be save in a 
structured text file (eg., CSV). Stored this way, we can estimate that approxi- 
mately 50 bytes (e.g., based on the website of the Murdoch University Weather 
Station?) are produced if each sensor produces data in every measurement. 

Next, we detail the steps of the behaviour of our Application implementa- 
tion which was used for all evaluation scenarios later (see Fig. 7): 


1. Set up the cloud using an XML. As we expect meteorological scenarios will 
often use private clouds, we used the model of our local private infrastructure 
(the LPDS Cloud of MTA SZTAKI); 

2. Set up the 487 stations (using a scenario specific XML description) with the 
previously listed 8 sensors per station; 

3. Start the Application to deploy an initial VM (generateAndAddVM()) for 
processing and to start the metering process in all stations (startStation()); 

4. Thestations then monitor (Metering () ),save and send (startCommunicate()) 
sensor data (to the cloud storage) according to their XML definition; 

5. A daemon service checks regularly if the cloud repository received a scenario 
specific amount of data (see the tasksize attribute in Fig.5). If there so, 
then the Application generates tasks which will finish processing within a 
predefined amount of time. 

6. Next, for each generated task, a free VM is searched (by VmSearch()). If a 
VM is found, the task and the relevant data is sent to it for processing. 

7. In case there are no free VMs found, the daemon initiates a new VM deploy- 
ment and holds back the not yet mapped tasks. 


? http: //idokep.hu/automata. 
3 http: //wwwmet.murdoch.edu.au/downloads. 
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8. If at the end of the task assignment phase, there are still free VMs, they are 
all decommissioned (by turnoffVM()) except the last one (allowing the next 
rounds to start with an already available VM). Note this behaviour could be 
turned on/off at will. 

9. Finally, the Application returns to step 5. 
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Fig. 7. Sequence diagram of the weather station modelling use case and its relations 
to our DISSECT-CF extensions 
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4.3 Evaluation 


In this sub-section, we reveal five scenarios investigating questions likely to be 
investigated with the help of extended DISSECT-CF. Namely, our scenarios 
mainly focus on how resource utilization and management patterns alter based 
on changing sensor behaviour (e.g., how different sensor data sizes and varying 
number of stations and sensors affect the operation of the simulated IoT system). 
Note, the scope of these scenarios is solely focused on the validation of our 
proposed IoT extensions and thus the scenarios are mostly underdeveloped in 
terms of how a weather service would behave internally. 

Before getting into the details, we clarify the common behaviour patterns, we 
used during all of the scenarios below. First of all, to limit simulation runtime, 
all of our experiments limited the station lifetimes to a single day. The start-up 
period of the stations were selected randomly between 0 and 20min. The task 
creator daemon service of our Application implementation spawned tasks after 
the cloud storage received more than 250 kBs of metering data (see the tasksize 
of Fig. 5). This step ensured the estimated processing time of 5 min/task. VMs 
were started for each 250 kB data set. The cloud storage was completely run 
empty by the daemon: the last spawned task was started with less than 250 kBs 
to process — scaling down its execution time. Finally, we disabled the dynamic 
VM decommissioning feature of the application (see step 8 in Sect. 4.2). 

In scenario N°1, we varied the amount of data produced by the sensors: we set 
50, 100 and 200 bytes for different cases (allowing overheads for storage, network 
transfer, different data formats and secure encoding etc.). We simulated the 487 
stations of the weather service. Our results can be seen in Fig. 8a and b. For the 
first case with 50 bytes of sensor data we measured 256 MBs of produced data 
in total, while in the second case of 100 bytes we measured 513 MBs, and in the 
third of 200 bytes we measured 1.02 GBs (showing linear scaling up). In the 3 
cases we needed 6, 10 and 20 VMs to process all tasks respectively. 

In scenario N°2, we wanted to examine the effects of varying sensor numbers 
and varying sensor data sizes per stations to mimic real world systems better. 
Therefore, we defined a fixed case using 744 stations having 7 sensors each, 
producing 100 bytes of sensor data per measurement, and a random case, in 
which we had the 744 stations with randomly sized sensor set (ranging between 
6-8) and sensor data size (50, 100 or 200 bytes/sensor). The results can be seen 
in Fig. 9a and b. As we can see we experienced minimal differences; the random 
case resulted in slightly more tasks. 

In scenario N°3, we examined random sensor data generation frequencies. We 
set up 600 stations, and defined cases for two static frequencies (1 and 5 min), 
and a third case, in which we randomly set the sensing frequency between 1 
and 5. In real life, the varying weather conditions may call for (or result in) 
such changes. In both cases, the sensors generated our previously estimated 50 
bytes. The results can be seen in Fig. 10a, b and c. As we can see the generated 
data in total: 316 MBs for 1min frequency, 63 MBs for 5min frequency, and 
143 MBs for the randomly selected frequencies. Here we can see that the first 
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Fig. 9. Scenario N°2 


case required the highest number of VMs to process the sensed data, but the 
randomly modified sensing frequency resulted in the highest number of tasks. 

In the three scenarios executed so far the main application, responsible for 
processing the sensor data in the cloud, checked the repository for new transfers 
in every minute. In some cases we experienced that only small amount of data has 
arrived within this interval (i.e. task creation frequency). Therefore in scenario 
N°4, we examined what happens if we widen this interval to 5 min. We executed 
three cases here with 200, 487 and 600 stations. The results can be seen in 
Fig. 11a. In Fig. 11b, we can read the number of VMs required for processing the 
tasks in the actual case. The first case has the highest difference in terms of task 
numbers: data coming from sensors of 200 stations needed more than 1400 tasks 
with 1 min interval, while less than 600 with 5 min interval. It is also interesting 
that with 600 stations almost the same amount of tasks were generated, but 
with the 5min interval we needed more VMs to process them. 

As we model a crowdsourced service, we expect to see a more dynamic 
behaviour regarding stations. In the previous cases we used static number of 
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Fig. 10. Scenario N°3 


stations per experiment, while in our final scenario, N°5, we ensured station 
numbers dynamically change. Such changes may occur due to station or sen- 
sor failures, or even by sensor replacement. In this scenario we performed these 
changes by specific hours of the day: from 0-5 am we started 200 stations, from 
6-8 am we operated 500 stations, from 9 am to 15 pm we scaled them down to 
300, then from 16-18 up to 500, finally the last round from 19-24 pm we set it 
back to 200. In this experiment we also wanted to examine the effects of VM 
decommissioning, therefore we executed two different cases, one with and one 
without turning off unused VMs. In both cases we set the tasksize attribute 
to 10 kB (instead of the usual 250 kB). The results can be seen in Fig. 12. We 
can see that without turning off the unused VMs from 6 pm we kept more than 
20 VMs alive (resulting in more overprovisioning), while in the other case the 
number of running VMs dynamically changed to the one required by the number 
of tasks to be processed. 

As asummary, in this section we presented five scenarios focusing on various 
properties of IoT systems. We have shown that with our extended simulator, we 
can investigate the behaviour of these systems and contribute to the development 
of better design and management solutions in this research field. 
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Fig. 12. Results of scenario N°5 


5 Implementing the Extension for a Fitness Application 


This use case was selected for implementation to allow us to replay real world 
data logs for multiple devices so that we could test the simulators trace replaying 
capabilities. It is important that the application can run through the trace logs 
for each device individually and correctly perform the network transfers that are 
detailed in it. The trace logs to be played were acquired with a special traffic 
interception application developed for the smartphone. Our application collected 
access and network traffic logs for the watch, smartphone, and the cloud. After 
data collection, the logs were saved in a file format ready to be used as an input 
trace to the simulator. This extension has been performed within a BSc thesis 
work [7] at the Liverpool John Moores University, UK. 
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5.1 Trace Collection 


Initially, we aimed to collect all of the network traffic between the three devices 
with a packet analysing software (such as Wireshark) on a laptop that acted as a 
wireless hotspot for the smartphone. However, this severely limited the accuracy 
of the traces as this requires disabling the network of the fitness application, 
when the phone is not connected to the laptop (to ensure all its communication 
with the cloud is caught). On top of this, we would have lost the ability to trace 
the Bluetooth traffic between the watch and the smartphone. 

As a result, we turned our attention of to methods that intercept network 
traffic directly through the phone. Despite the multitude of third party android 
network traffic analysers, we could not find one that met our requirements: 
(i) should run at the background (allowing us to use the fitness application 
at will); (ii) should have output logs on network and bluetooth activity either 
directly processable by the simulator or in a format that could be easily trans- 
formed to the needed form; and (iii) should remain active for long periods of 
time (as the log collection ran for days). 

As a result, we have decided to create an application that met all of these 
requirements and would allow us to localise the data collection into one place. 
The Fitbit connection monitor application* is built on top of an android sub- 
system called the Xposed Framework. Using this framework, we were able to 
intercept socket streams for network I/O, while for bluetooth, we have used 
intercepted traffic through android’s GATT service. 

A sample of intercepted data traces is shown in Fig.13. This figure shows 
the data that was collected from the Fitbit Connection Monitor over the course 
of around 2 weeks (over 20,000 trace entries of real life data). There are several 
interesting situations one can observe in the raw data. First, it shows peaks of 
network activity in cases when: (7) there was a manually invoked data synchro- 
nisation (ii) or when the user issued firmware update request for the watch. In 
contrast, there were gaps in the data collection as well. These gaps represent 
situations such as: (i) the user did not wear his/her watch, (ii) Bluetooth was 
disabled on the smartphone or (iii) the watch was not switched on (e.g., because 
of running out of battery power). 


5.2 Implementation and IoT Extensions to DISSECT-CF 


In our initial implementation, we have followed a similar approach as we did 
with the meteorological case. We have implemented the fitness use case with 
the original DISSECT-CF APIs. Then we also implemented a solution that was 
built on top of the our new IoT oriented extensions of DISSECT-CF APIs°. To 
better understand this solution, first we summarize the extensions. 


t The application is open source and available at https://github.com/Andrerm124/ 
FitbitConnectionMonitor. 

5 The source code of the second implementation is available online at https://github. 
com/Andrerm124/dissect-cf/tree/FitbitSimulation. 
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Fig. 13. Real-life network traffic in the fitness use case according to the long term trace 
collection results 


Figure 14 presents the new extensions to DISSECT-CF. With the extension, 
one can define a simulation with two XML files. First, the original simulator 
API loads all of the physical machines from the supplied Machine XML file (the 
loaded up machines will represent the computational, network and storage capa- 
bilities of the IoT devices). In the second XML, device models can be linked to 
each of the previously loaded machines. Each model can be customised inde- 
pendently by altering the desired attributes of the built in device templates. 
In these templates, one can define the following details: (i) machine id to bind 
to, (ii) time interval for the presence of the device, (iii) custom attributes and 
behaviour — this part still must be coded in java —, (iv) network behaviour — in 
the form of a trace or a distribution function, (v) typical network endpoints and 
(vi) data storage and caching options (both device local and remote — e.g., in the 
cloud). The loading of these XML files and the management of the device objects 
is accomplished by the Application class. Finally, the extension provides alter- 
native packet routing models as well in the form of the several implementations 
for the ConnectionEvent interface. 
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Simulator API 
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+ registerObject(so: StorageObject): boolean latencies: Map<String, Integer> 
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Fig. 14. The IoT oriented DISSECT-CF extensions 


To analyse the effectiveness of our extensions, we have compared the devel- 
opment time and the simulation results for the fitness application. The initial 
implementation has been created as custom classes for all devices participating 
in the use case. This required approximately 3 days of development time. In con- 
trast, with the new extensions, barely more than 20 lines of XML code (shown 
in Fig.15) plus the previously collected trace files were required to define the 
whole simulation. To validate the new implementation, we also compared the 
data produced from this new and the initial completely java based implemen- 
tation. We have concluded that the two implementations produced equivalent 
results (albeit the XML based one allowed much more rapid changes to device 
configurations and to their behaviour). 
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<?xml version="1.0" encoding="UTF-8" standalone="yes"?> 
<Simulation> 
<Devices> 
<Device> 
<ID>Watch</ID> 
<TraceFileReader> 
<SimulationFilePath>bluetooth_in.csv</SimulationFilePath> 
</TraceFileReader> 
</Device> 
<Device> 
<ID>Smartphone</ID> 
<TraceFileReader> 
<SimulationFilePath>network_out.csv</SimulationFilePath> 
</TraceFileReader> 
</Device> 
<Device> 
<ID>Cloud</ID> 
<TraceFileReader> 
<SimulationFilePath>network_in.csv</SimulationFilePath> 
</TraceFileReader> 
</Device> 
</Devices> 
</Simulation> 


Fig. 15. XML model of the fitness use case 


5.3 Evaluation 


To evaluate our extensions, we have set up the exact same situation in the simu- 
lation as we have had during the trace collection. We also ensured the simulation 
writes its output in terms of simulated network and computing activities in the 
same format as the originally collected traces. This allowed easy comparison 
between the simulated and the real-life traces. Figure 16 show the comparison 
of the bluetooth trace. According to the figure, the simulation can accurately 
reproduce the real-life traces, i.e., the simulated data transfers occur at the pre- 
scribed times and have the same levels of data movement as the ones recorded in 
real-life. The network communication between the cloud and the smartphone has 
shown similar trends (thus the simulation was capable to reproduce the complete 


Fig. 13). 
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6 Conclusion 


Distributed systems simulators are not generic enough to be applied in newly 
emerging domains, such as IoT Cloud systems, which require in depth analysis of 
the interaction between IoT devices and clouds. Research in this area is facing 
questions like how we should govern such large cohort of devices, which may 
easily go up often to tens of thousands. 

In this chapter we investigated various IoT Cloud use cases, and derived a 
general IoT use case. We have shown, how generic IoT sensors could be modelled 
in the DISSECT-CF simulator, and exemplified how the fundamental properties 
of IoT entities can be represented. Finally, we validated the applicability of the 
introduced IoT extension with a fitness and a meteorological application. 
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Abstract. The Internet of Things (IoT) consists of resource-constrained 
devices (e.g., sensors and actuators) which form low power and lossy net- 
works to connect to the Internet. With billions of devices deployed in vari- 
ous environments, IoT is one of the main building blocks of future Internet 
of Services (IoS). Limited power, processing, storage and radio dictate 
extremely efficient usage of these resources to achieve high reliability and 
availability in IoS. Denial of Service (DoS) and Distributed DoS (DDoS) 
attacks aim to misuse the resources and cause interruptions, delays, losses 
and degrade the offered services in IoT. DoS attacks are clearly threats 
for availability and reliability of IoT, and thus of IoS. For highly reliable 
and available IoS, such attacks have to be prevented, detected or mitigated 
autonomously. In this study, we propose a comprehensive investigation of 
Internet of Things security for reliable Internet of Services. We review the 
characteristics of IoT environments, cryptography-based security mech- 
anisms and D/DoS attacks targeting IoT networks. In addition to these, 
we extensively analyze the intrusion detection and mitigation mechanisms 
proposed for IoT and evaluate them from various points of view. Lastly, we 
consider and discuss the open issues yet to be researched for more reliable 
and available IoT and IoS. 


Keywords: IoT - IoT security - IoS - DoS - DDoS 
Internet of Things - Internet of Services - Reliable IoS 


1 Introduction 


Internet of Things is a network of sensors, actuators, embedded and wearable 
devices that can connect to the Internet. Billions of devices are expected to be 
part of this network and make houses, buildings, cities and many other deployment 
areas smarter [17]. In order to reach populations as much as billions, elements of 
IoT network are expected to be cheap and small form-factor devices with limited 
resources. 

IoT is a candidate technology in order to realize the future Internet of Services 
and Industry 4.0 revolution. Accommodation of billions of devices with sensing 
© The Author(s) 2018 
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and/or actuation capabilities will introduce crucial problems with management, 
interoperability, scalability, reliability, availability and security. Autonomous 
control and reliability of future IoS are directly related to reliability and avail- 
ability of IoT. However, there are serious threats for IoT, which aim to degrade 
the performance of the network, deplete the batteries of the devices and cause 
packet losses and delays. These attacks are called as Denial of Service attacks, 
which are already notorious for their effects in existing communication systems. 
Limited power, processing, storage and radio dictate extremely efficient usage 
of these resources to achieve high reliability and availability in IoS. However, 
DoS and DDoS attacks aim to misuse the resources and cause interruptions, 
delays, losses and degrade the offered services in IoT. DoS attacks are clearly 
threats for availability and reliability of IoT, and thus of IoS. For highly reli- 
able and available IoS, such attacks have to be prevented, detected or mitigated 
autonomously. 

DoS and DDoS attacks can target any communication system and cause dev- 
astation. Such attacks make use of the vulnerabilities in the protocols, operating 
systems, applications and actual physical security of the target system. Readers 
can easily find several incident news related to D/DoS attacks on the Internet. 
These attacks are so common that every day it is possible to see them (e.g., 
please check the digital attack map of Arbor Networks and Google Ideas [3]). It 
is not hard to predict that IoT will face with D/DoS attacks, either as a target 
or source of the attacks. In fact, quite recently one of the major Domain Name 
System (DNS) infrastructure provider of popular web sites and applications was 
the target of DDoS attacks where a botnet called as Mirai compromised thou- 
sands of cameras and digital video recorder players [2]. This incident was the 
first example of IoT being used as an attack source for DDoS. It clearly showed 
that, protection of IoT networks from attacks is not sufficient and protection of 
the Internet from IoT networks is needed as well. 

A very interesting report [53] on how security of IoT will be playing an impor- 
tant role in defining the cybersecurity of future was published by UC Berkeley 
Center for Long-Term Cybersecurity in 2016. A group of people from various 
disciplines developed five scenarios regarding with what will security be like in 
the future considering various dimensions including people, governments, orga- 
nizations, companies, society, culture, technological improvements and of course 
attackers. Although all of the scenarios are related to the security of IoT, the 
last two scenarios have direct relations. The fourth scenario puts the emphasis 
on the ubiquity of IoT in a way that IoT will be everywhere and will be play- 
ing a vital role on the management of several applications and systems. This 
will give attackers more chance to target. In such a world, attackers will able to 
affect organizations, governments and the daily life of people easier than now. 
Thus, cybersecurity term will be transformed to just security since it will able 
to affect everything. The last scenario considers the wearable devices and their 
novel purpose of use. According to the hypothesis, the wearables of future will 
not only perform basic measurement tasks, but will be used to track emotional 
states of humans. Advancements in the technologies will allow such a change. 
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Emotional, mental and physical state information which is very important for 
individuals will be the target of attackers and will be used as a weapon against 
them. Of course in such a scenario, it will be very crucial for people to manage 
their emotional, mental and physical state and this will affect the society in var- 
ious ways which we can not imagine. This report clearly shows that if we fail 
to secure the IoT networks, then the ubiquity and proliferation of IoT will not 
transform the future to smarter but will cause catastrophic effects on human 
life, environment, culture and society. 

Securing IoT networks is not an easy problem since we have to think of 
device, network and application characteristics, affordable cryptography-based 
solutions, physical security of the network and devices, compromise scenarios, 
intrusion detection systems. Designers and administrators will face many trade- 
offs, where security will be on one side and cost, network lifetime, Quality-of- 
Service (QoS), reliability and many more will be on the other side. When we 
are considering all of these dimensions, we should not avoid the user side. We 
have to bear in mind that users may not be security-aware. We also have to 
pay attention to propose user-friendly solutions which consider the usability 
and the user experience. If our solutions in the services that we provide at not 
satisfactory, then our efforts will be in vain, making the attackers’ job easier. 

The goal of this study is to present researchers a comprehensive investigation 
of IoT security for reliable future IoS. In order to be comprehensive, we analyzed 
the majority of the digital libraries (i.e., IEEE, ACM, Web of Science, Springer- 
link, Google Scholar) for quality conference, journal and magazine proposals. 
Studies published between 2008 and 2017 were included in this work where sev- 
enteen studies were analyzed to examine the D/DoS attacks for IoT networks 
and twenty-six studies were evaluated which either analyze the effects of the 
attacks, or propose a mitigation or a detection system against such attacks. 

The remaining sections of this work are organized as follows: In Sect. 2, we 
briefly explain the related works. Section3 explores the characteristics of IoT 
environments with devices, networks and applications. Section 4 considers Inter- 
net of Things security extensively. In Sect.5, we examine D/DoS attacks for 
IoT. Section 6 consists of studies which analyze the effects of the D/DoS attacks 
for IoT networks. In Sect. 7, we examine the mitigation systems against D/DoS 
attacks, as well as security solutions for specific protocols. Section 8 is on the 
intrusion detection systems proposed for IoT, where we analyze several propos- 
als from various points of views. In Sect.9, we discuss the open problems and 
issues in IoT security and aim to provide new research directions. Finally Sect. 10 
concludes this study. 


2 Related Works 


The Internet of Things is one of the most active topic of research nowadays. There 
are several surveys which address the security of IoT, attacks, countermeasures 
and Intrusion Detection Systems for IoT. 

Zarpelao et al. [60] proposed a taxonomy of IDSes based on the placement 
approaches, detection methods and validation strategies. In their work, the authors 
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point out that IoT has unique characteristics, which will bring unique threats and 
novel requirements for IDSes. According to their findings, IDSes proposed for IoT 
need to address more attacks, more communication technologies and more pro- 
tocols. They also indicated that IDS traffic should be managed securely and IDS 
designs should pay attention to the privacy of the host. 

Adat et al. [5] proposed a literature review on the security of IoT where history 
of IoT security, taxonomy of security challenges and requirements, cryptography- 
based defense mechanisms and IDSes were evaluated. The authors suggested read- 
ers to research lightweight authentication schemes, to target 6LoWPAN and RPL 
security and to consider the resource limitations of IoT devices. 

Samaila et al. [46] proposed an extensive analysis of security challenges of 
IoT. In this study the authors considered several issues including implementa- 
tion of security in IoT, resource limitations, heterogeneity of IoT environments, 
applications and devices, security awareness of the users and maintenance of 
security after deployment. 

Yang et al. [58] studied security and privacy issues in IoT. Their work consid- 
ered the limitations of IoT environments which affect the security and privacy. 
They provided a classification of the attacks based on the layers of an IoT archi- 
tecture and analyzed the cryptography-based security solutions for IoT networks 
in depth. 

In this study, we aimed to provide a comprehensive view on security of IoT 
for reliable IoS. Although there are some topics of interest and points of view 
in common with the previous reviews, our work tries to depict a more complete 
picture of security of IoT. 


3 Internet of Things 


Internet of Things can be defined in several ways from various angles and there 
is no standard definition for it. However, from the engineering point of view, IoT 
is a network of any things, each supplied with a computing system (i.e., CPU, 
memory, power source and a communication interface like radio or Ethernet), 
each is uniquely identifiable and addressable and connected to the Internet. 
In this section, we will firstly propose a generic architecture for loT which we 
think will be helpful to understand IoT environments better. After that, we will 
summarize the standardized protocol stack [37] we focus on in this study. 


3.1 Internet of Things Architecture 


We believe that, exploring the architectural components is a very useful way to 
see the complete picture and understand IoT environments better. In Fig. 1, we 
outline a generic IoT architecture which is based on the general architectures 
previously proposed in [24,57,59]. The only difference of our architecture from 
the reference works is that we separated the IoT Access Network Layer from 
the IoT-Internet Connection Layer, whereas the reference studies combine them 
into a single layer called either as Network Layer or Transport Layer. 
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Business Layer 


Application Layer 
(Smart-Home, Smart-City, Healthcare, Industry, ...) 


Processing Layer 
(Centralized Database, Cloud, Fog, Services, ...) 


IoT — Internet Connection Layer 
(Fiber optics, Satellite, ...) 


IoT Access Network Layer 
(6LOWPAN, BLE, Cellular (NB-IoT, LTE-M), LoRa, 
Thread, Ethernet, WiFi, ZigBee, RF, Power Line.,...) 


Perception Layer 


(sensors, actuators, RFID tags, embedded devices, ...) 


Fig. 1. Generic architecture of IoT 


In the generic architecture, the lowest layer is the Perception Layer. It con- 
sists of sensors, actuators, RFID tags and any other embedded devices. Most 
of these devices are expected to be small form-factor devices with constrained 
resources (i.e., power source, processing, storage and communication interface). 
The majority of IoT devices will use battery as the power source. However, based 
on the application environment, mains-powered devices or energy-harvesting ele- 
ments may exist as well. Since power will be a scarce resource, power consump- 
tion of the nodes (i.e., devices in the network) has to be minimized. In addition to 
various techniques to reduce the power consumption, IoT devices use low-power 
radios to keep the energy footprint as small as possible and lengthen the network 
lifetime. Typically low-end microcontrollers with RAM and ROM in the order 
or KBs constitute the big portion of nodes accommodated in IoT networks. In 
addition to the resource characteristics, mobility of the devices is important as 
well. Devices in the Perception Layer can be either static or mobile, but the 
percentage of mobile devices will be smaller than the static ones. 

The IoT Access Network Layer is the second layer in our architecture, in 
which the nodes in the Perception Layer form a network. In this layer, there 
are several communication technologies (i.e., 6LoWPAN, Bluetooth Low Energy 
(BLE), LoRa and LoRaWAN, WiFi, Ethernet, Cellular, ZigBee, RF and Thread) 
which are candidates for the in-network communication. Most of them are open 
technologies, whereas some of them are (e.g., ZigBee, LoRa, Cellular) propri- 
etary. These communication technologies provide varying data rates and trans- 
mission ranges in return of different power consumptions and costs. Hence, 
depending on the several design constraints, the nodes in the Perception Layer 


342 A. Aris et al. 


can form IoT networks with different characteristics. Among these technologies, 
BLE, WiFi, LoRa and Cellular offer star-based topologies. However, 6LoWPAN, 
ZigBee and Thread support mesh topologies, where elements of the network can 
forward others’ packets. Some of them are proposed for specific application areas 
(i.e., Thread was proposed for smart-home environments). Most of these tech- 
nologies require a gateway or border router which is used to connect the nodes 
in IoT network to the Internet. 

The third layer in our generic architecture is the IoT - Internet Connection 
Layer, where a border router or gateway connects the inner IoT network to 
the Internet via communication technologies, such as fiber optics or satellite 
communication. 

Processing, analysis and storage of the collected data are performed at the 
Processing Layer. Designers can choose centralized storage and processing sys- 
tems, or distributed storage and processing systems (e.g., cloud or fog computing 
environments). Middleware services are provided in this layer based on the pro- 
cessed and analyzed data. This is one of the most important layer in the archi- 
tecture of IoT, since valuable information is extracted here from the collected 
data which can be in big volumes, variety and veracity. 

The Application Layer is the fifth layer within the generic IoT architecture. 
In this layer, we see applications in various deployment areas, which make use of 
the meaningful information obtained from Processing Layer. Applications of IoT 
can be in home, building, industry, urban or rural environments. Applications 
of home environments can be health-reporting and monitoring, alarm systems, 
lighting applications, energy conservation, remote video surveillance [13]. Build- 
ing environments IoT applications can be Heating Ventilation and Air Condi- 
tioning (HVAC) applications, lighting, security and alarm systems, smoke and 
fire monitoring and elevator applications [31]. Industrial IoT applications can be 
safety, control and monitoring applications with different emergency classes [38]. 
In urban environments, there may be broad range of applications. Lighting appli- 
cations, waste monitoring, intelligent transportation system applications, mon- 
itoring and alert reporting are only a few of them. Rural environments may 
include monitoring applications (e.g., bridges, forests, agriculture, etc.). 

The Business Layer is the last layer in the generic architecture, which includes 
organization and management of IoT networks. Business and profit models are 
constructed here in addition to charging and management operations [57]. 


3.2 Standardized Protocol Stack for Low Power and Lossy 
Networks 


Multiple communication technologies exist for the loT-Access Network Layer as 
we mentioned in Sect. 3.1. Since Thread, NB-IoT, LTE-M, LoRa/LoRaWAN are 
very new communication technologies, there were not any studies which focus 
on the D/DoS attacks that may target such networks during the time we were 
working on this proposal. ZigBee is a proprietary technology and it also uses the 
same physical and MAC layers as 6LoWPAN-based networks. Thus PHY and 
MAC layer attacks for 6-LoWPAN-based networks covers the PHY and MAC 
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layer attacks for ZigBee-based networks too. WiFi targets more resource-rich 
devices than 6LoWPAN. Therefore, it may not be a good candidate for low power 
and lossy networks-based IoT applications where majority of the devices will be 
battery-powered devices with small form factors and reasonable costs. Bluetooth 
Low Energy technology might be a good option for the IoT-Access Network Layer 
with very low power consumption, increased data rate and range. However, it 
suffers from the scalability problem where Bluetooth-based networks face with 
issues when the number of slaves exceeds seven [23,25]. Up until Bluetooth 5, 
park state was supported by Bluetooth which was allowing more than seven slaves 
to be part of the Bluetooth network in turns. But Bluetooth 5 does not support 
it any more and instead it brought scatternets, which aims to create multi-hop 
Bluetooth networks with specific nodes acting as routers between piconets. How- 
ever, currently no commercial Bluetooth radio supports it and synchronization 
and routing operations will make the scatternet operation in Bluetooth networks 
a complex issue to deal with. Hence, considering the aforementioned reasons, we 
focus on the 6LoWPAN-based IoT networks in this study. 

IEEE and IETF proposed several standards and protocols in order to connect 
resource-constrained nodes to the Internet within the concept of IoT. Palattella 
et al. [37] proposed a protocol stack for low power and lossy IoT networks which 
makes use of the protocols/standards proposed by IEEE and IETF. The stan- 
dardized protocol stack is shown in Fig. 2. 

The standardized protocol stack includes IEEE 802.15.4 [1] for physical layer 
and MAC layers. This standard promises energy-efficient PHY and MAC oper- 
ations for low power and lossy networks and is also used by Thread and ZigBee 
technologies. 

The expected cardinality of the IoT networks (e.g., of the order of billions) 
and already exhausted IP v4 address space force IoT to use IPv6 addresses. How- 
ever, when IPv6 was proposed, low power and lossy networks were not consid- 
ered, which resulted in the incompatible packet size issue. The maximum trans- 
mission unit of IEEE 802.15.4-based networks is far too small compared to IPv6 
packet sizes. In order to solve this problem, IETF proposed an adaptation layer, 


IPv6 


Fig. 2. Standardized protocol stack 
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IPv6 over Low Power Wireless Personal Area Networks (6LoWPAN) [18]. 6LoW- 
PAN makes use of header compressions to permit transmission of IEEE 802.15.4 
fragments carrying IPv6 packets. 

RPL [56] was proposed by the IETF as IPv6 routing protocol for low power and 
lossy networks. Formation of IEEE 802.15.4-based mesh networks was made possi- 
ble by the RPL routing protocol, which constructs Destination Oriented Directed 
Acyclic Graphs (DODAG). A DODAG root creates a new RPL instance and lets 
other nodes to join the network by means of control messages. There are four types 
of control messages, which are DODAG Information Solicitation (DIS), DODAG 
Information Object (DIO), Destination Advertisement Object (DAO) and DAO- 
Acknowledgment (DAO-ACK) messages. DIS messages are broadcasted by new 
nodes to obtain the information about the RPL instance in order to join the net- 
work. Neighbor nodes reply with DIO messages which carry information about 
the RPL network (i.e., DODAG ID, instance ID, rank, version number, mode of 
operation, etc.) and their position in the network. The position of a node, which 
is the relative distance of a node from the DODAG root is named as rank. Rank 
is carried in DIO messages and it is calculated by each node based on the Objec- 
tive Function (OF) and the rank of neighbor nodes. OF types, include, but are not 
limited to, hop count, expected transmission count, remaining energy. RPL lets 
network administrators to select a suitable OF based on the QoS requirements. 
When a node receives DIO messages from its neighbors, it calculates its rank and 
informs its neighbors about its rank with a new DIO message. Based on the rank 
of its neighbors, it selects the one with the lowest rank value as a preferred par- 
ent and informs that node with a DAO message. The receiving node replies with a 
DAO acknowledgment message and thus a parent-child relationship is set up. An 
example RPL network is shown in Fig. 3. 

In RPL upward routes (i.e., the routes towards the DODAG root) are created 
by means of DIO messages, whereas downward routes are created by DAO mes- 
sages. In order to minimize the overhead of control messages, RPL uses Trickle 


DAO 
Messages 


DIO 
Messages 


Fig. 3. An example RPL network 
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Timer [30] to reduce the number of control messages created as network gets 
more stable. Nodes are expected to follow the rules of the RPL specification in 
order to create loop-free and efficient RPL DODAGs. In low power and lossy 
networks, faults and problems tend to occur. To recover from such issues, RPL 
accommodates repair mechanisms (i.e., global repair and local repair). 

The standardized protocol stack for low power and lossy networks employs 
the Constrained Application Protocol (CoAP) [50] for the application layer. 
CoAP is built on top of UDP and supports Representational State Transfer 
(REST) architecture. By means of CoAP, even resource-constrained nodes can 
be part of the World Wide Web (WWW). In order to optimize the data carried 
by CoAP messages, the IETF proposed another standard for the binary repre- 
sentation of the structured data called Concise Binary Object Representation 
(CBOR) [12] on top of CoAP. 


4 Internet of Things Security 


In Sect. 3.2, we briefly summarized the standardized protocol stack which con- 
sists of standards and protocols proposed by IEEE and IETF for low power and 
lossy networks. In this section, we focus on the security of IoT networks which 
accommodate the standardized protocol stack. 

Securing a communication network is not an easy task and requires a compre- 
hensive approach. In such a study, we have to determine assets, think of threats 
and consider compromise scenarios and possible vulnerabilities. Following these, 
we have to find the suitable solutions which will help us to ensure a secure sys- 
tem. When we think of the solutions, the first thing that probably comes into 
our minds is the cryptography. Cryptography promises to provide confidential- 
ity and integrity of the messages, authentication of the users and systems and 
non-repudiation of the transactions. Confidentiality means that the content of 
the message is kept secret from eavesdroppers. Integrity ensures that the con- 
tent of the message is not changed and is still the same as the first time it was 
produced. Authentication allows the end points of the communicating parties 
to identify each other and determine the correct target of the communication. 
Non-repudiation prevents one end of the communication to deny its actions that 
it performs and protects the other end. 

In this section, we firstly outline the cryptography-based security solutions for 
the low power and lossy networks which employ the standardized protocol stack. 
After that, we analyze the protocols and point out the advantages and disadvan- 
tages. Then we will inquire whether cryptography is enough for us or not. 


4.1 Cryptography-Based Security Solutions for Low Power 
and Lossy Networks 


A number of cryptography-based solutions exist so as to secure the low power 
and lossy networks that employ the standardized protocol stack. These solutions 
are shown in Fig. 4. 
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TREE BOAS ~ Link Layer Security 


Fig. 4. Cryptography-based security solutions for low power and lossy networks 


IEEE 802.15.4 PHY and Link Layer Security [1] provides security for the 
communication between two neighbors in IEEE 802.15.4-based networks. This 
hop-by-hop security solution promises confidentiality, authenticity and integrity 
against insider attackers. 

The Internet Protocol Security (IPSec) [22] aims to provide end-to-end secu- 
rity. It consists of a set of protocols, which are Authentication Headers (AH), 
Encapsulating Security Payloads (ESP) and Security Associations (SA). AH 
provides authentication and integrity, whereas ESP promises confidentiality in 
addition to authentication and integrity. Designers can select either of them but 
regardless of the selection, SA has to run initially to setup the security param- 
eters. IPSec provides security for IP-based protocols and it is independent from 
the protocols above the network layer. 

In addition to IPSec, RPL provides secure versions of the control messages. 
Although it is optional, confidentiality, integrity and authentication of the control 
messages are assured. 

Datagram Transport Layer Security (DTLS) [44] aims to secure UDP- 
based applications. Similar to the other solutions, it ensures the confidentiality, 
integrity and authenticity of datagrams. 

CoAP provides security bindings for DTLS in CoAPs scheme. It lets designers 
to choose to run DTLS with preshared keys, public keys and/or certificates in 
order to secure CoAP traffic. Although Fig. 4 does not show any other security 
mechanisms working at the application layer and above, the IETF has draft 
documents (i.e., Object Security of CoAP (OSCoAP), CBOR Object Signing 
and Encryption (COSE) and Ephemeral Diffie-Hellman over COSE (EDHOC)) 
which aim to provide security at the application layer and above. 


4.2 Which Security Solution to Use? 


As we can see, there are a number of security solutions to protect low power and 
lossy networks and it is hard to determine which solution to use. 

IEEE 802.15.4 PHY and Link Layer Security is independent from the net- 
work layer protocols and most of the radios support it. Independence from the 
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upper layer protocols means that we do not have to change anything with them. 
However, since IEEE 802.15.4 PHY and Link Layer Security provides the secu- 
rity between two neighbors, trustworthiness of every node on the routing path 
becomes a very crucial issue. If the routing path has a malicious node, then 
the security of the messages routed through this path cannot be guaranteed. In 
addition to this, IEEE 802.15.4 PHY and Link Layer Security works only in the 
IoT - Access Network Layer in our generic architecture and when messages leave 
this layer and enter the Internet, they are no more protected [40]. 

IPSec provides end-to-end security and is independent from the upper layer 
protocols. End-to-end security guarantees security between two hosts which can 
be in different networks. Designers do not have to worry about the trustwor- 
thiness of the other nodes, devices or networks on the path. However, it brings 
burden to GLOWPAN layer, where packets with IPSec require header compres- 
sion [40]. In addition to this, Security Associations is connection-oriented and 
simplex, which means if two hosts want to send packets secured with IPSec to 
each other, then each of them individually need to establish SAs [26]. Further- 
more, firewalls may limit the packets with IPSec and Internet Service Providers 
(ISPs) tend to welcome packets with IPSec as business-class packets and prefer 
to charge them more. So, if IoT data will be secured with IPSec, there are a 
number of issues we have to consider before using it. 

DTLS serves as the security solution between two UDP-based applications 
running on different end-points. Although it aims to protect the application layer 
data, it does not promise security for anything else. This means, if we employ 
DTLS as the only security solution, then we cannot protect IP headers when 
packets are passing through the IoT - Access Network Layer and through the 
Internet. So, security of the routing becomes susceptible to the attacks, such as 
DoS and DDoS attacks. This is why the primary security concern of DTLS is on 
D/DoS attacks. 


4.3 We Have Cryptography-Based Solutions, Are We All Set? 


In Sect. 4.1 we outlined the cryptography-based security solutions very briefly. 
As we explained in Sect. 4.2, each solution comes with its advantages and dis- 
advantages. It is not an easy task to select the appropriate solution. However, 
there are a number of other issues which we have to consider when protecting 
IoT networks. 

First of all, cryptography is generally thought to be heavy weight, and is full 
of resource consuming operations implemented in software and/or hardware. 
When we consider the resource limitations of the devices in low power and lossy 
networks, affordability of such solutions becomes questionable. Designers have 
to face the trade off between security and very crucial parameters such as cost, 
network life time and performance. 

Secondly, although cryptography-based solutions are proved to be secure, 
proper implementation of the protocols and algorithms is extremely important. 
However, most of the implementations of these solutions have vulnerabilities as 
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reported by researchers [8]. In addition to this, in order to shorten the develop- 
ment time, engineers tend to use the code examples shared on forums. These code 
examples working properly does not mean that they are vulnerability-free [4]. 

Physical security of the networks and devices are as important as our other 
concerns. It is directly related to the applicable type of attacks. If the physical 
security of the deployment area is weak, which is the case for most of the deploy- 
ments, and if devices do not have protection mechanisms against tampers which 
is due to reduce the cost, then it is possible for attackers to insert a malicious 
device or grab a device and extract the security parameters and leave a malicious 
device back. 

In addition to the cost of cryptography, issues with correct implementation 
and physical security, we have to consider users as well. We know that most of the 
people are not security-aware and usability of security mechanisms have prob- 
lems [47]. Therefore, compromise scenarios have to think of users and external 
people involving with the IoT network, applications and deployment areas. 

Although we have cryptography, our networks and systems are still suscep- 
tible to some type of attacks, called Denial of Service attacks [54]. In the next 
section, we will examine the DoS and DDoS attacks which may target low power 
and lossy networks employing the standardized protocol stack. 


5 Denial of Service Attacks Targeting Internet of Things 
Networks 


Denial of Service attacks aim to misuse the available resources in a communi- 
cation network and degrade or stop the services offered to ordinary users. Since 


Table 1. D/DoS attacks which may target IoT networks 


Physical layer | MAC layer 6LOWPAN layer | Network layer Transport and 
application layer 
Node Capture | Jamming Fragment Dupl. | Rank Flooding 
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Replay Protection 
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Caching 

Sybil Risk of 
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Sinkhole Cross-Protocol 
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IoT will be one of the main building block of Internet of Services, detection, 
mitigation and prevention of such attacks are very crucial. 

In this section, we present and explain the D/DoS attacks which may target 
IoT networks. Table 1 categorizes such attacks with respect to the layers of the 
standardized protocol stack. This categorization is an extended version of our 
previous study [10]. 


5.1 D/DoS Attacks to the Physical Layer 


Physical Layer D/DoS attacks are node capture, jamming and spamming. 

As its name implies, in node capture attacks, attackers capture the physical 
nodes within the network. The aim of the attackers may be creating routing 
holes or tampering the device and extracting security parameters. After that, 
they may place the node back with the compromised software or place the node 
with a replica of it. By this way, they can apply various attacks (e.g., other 
attacks categorized as higher layer attacks). 

Physical Layer jamming attacks comprise of malicious devices creating inter- 
ference to the signals transmitted in the physical layer [7]. Attackers can con- 
stantly, randomly or selectively (i.e., jamming signals carrying specific packets, 
such as routing or data packets) apply jamming. 

In spamming attack, attackers place malicious QR codes to the deployment 
areas which cause users to be forwarded to malicious targets on the Internet [42]. 


5.2 D/DoS Attacks to the MAC Layer 


MAC Layer D/DoS attacks are link layer jamming, GTS, backoff manipulation, 
CCA manipulation, same nonce attack, node specific flooding, replay protection 
attack, ACK attack, man-in-the-middle, ping-pong effect, bootstrapping attack, 
PANID conflict and stenography. 

Link layer jamming is a type of jamming where frames are jammed instead 
of signals as in the physical layer [7]. 

IEEE 802.15.4 standard has an optional feature called as Guaranteed Time 
Slot (GTS) which works in beacon-enabled operational mode. GTS is intended 
for timely critical applications that require strict timing with channel access 
and transmissions. Nodes have to request and allocate time slots in order to use 
this feature. However, if attackers cause interference during this process (e.g., 
by jamming), then ordinary nodes cannot register themselves for the guaranteed 
time slots and thus QoS of the application gets affected. This attack is called 
GTS attack [51]. 

ACK attack consists of attackers creating interference to Acknowledgment 
(ACK) frames and thus causing a node to believe that its fragment was not 
successfully received by the receiving node [7]. By this way targeted nodes are 
forced to retransmit the same fragment and consume more power. QoS of the 
running application would be affected by it too. Moreover, it may cause the 
sender node believe that its next hop neighbor is filtering the messages. 
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Clear Channel Assessment (CCA) mechanism is used by nodes to sense the 
channel and find out if any other node is currently using the channel or not. 
This approach is commonly used to prevent collisions. However, attackers can 
skip CCA and access the channel, which causes collisions. By this way, delays, 
retransmissions and unnecessary energy usage occurs. This attack is called CCA 
manipulation [7]. 

Backoff manipulation attack compromises the backoff periods of Carrier 
Sense Multiple Access (CSMA)-based medium access with attackers choosing 
shorter backoff times instead of longer [7]. By this way, they get the chance to 
use the channel as much as possible and limit the other users’ channel accesses. 

Sequence numbers are used in the IEEE 802.15.4 standard in order to prevent 
malicious devices sending the previously sent fragments over and over. However, 
in replay protection attacks [7], attackers can still misuse it by sending frames 
with bigger sequence number than the targeted ordinary node. This would cause 
the receiving node drop the fragments coming from the ordinary node since it 
now looks like it is sending old fragments. 

As we mentioned in Sect.4, IEEE 802.15.4 PHY and Link Layer Security 
is a candidate security mechanism for IoT security, which promises to protect 
the communication between two neighbor nodes. If nodes share the same key 
and nonce values in the implementation of IEEE 802.15.4 PHY and Link Layer 
Security, then attackers may extract the keys by eavesdropping the messages 
which happens in the same nonce attack [7]. 

The PANID Conflict attack misuses the conflict resolution procedure of IEEE 
802.15.4 which functions when two coordinators are placed close to each other ina 
deployment area and holding the same Personal Area Network ID (PANID). Mali- 
cious nodes may transmit a conflict notification message when there is actually no 
conflict to force the coordinator to initiate the conflict resolution process [7]. 

Another MAC Layer D/DoS attack is the ping-pong effect, where malicious 
nodes intentionally switch between different PANs [7]. If attackers choose to do 
it frequently, then they may cause packet losses, delays and extra overhead to 
the already limited resources. 

In the bootstrapping attack, attackers aim to obtain useful information about 
a new node joining the network. In order to do so, firstly a targeted node is forced 
to leave the network by the attackers. Then when it tries to join the network 
again, attackers obtain the bootstrapping information which they may use to 
associate a malicious node to the network [7]. 

Node specific flooding attacks are a type of flooding attack which is applied 
at the MAC layer [7]. In this attack, malicious nodes send unnecessary fragments 
to the target node which aims to consume its resources and thus is no longer 
able to serve for its ordinary purpose. 

The Stenography attack abuses the unused fields in the frame format of IEEE 
802.15.4. Unused bits can be used by the attackers to carry hidden information [7]. 
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5.3 D/DoS Attacks to the 6LOWPAN Layer 


Hummen et al. [19] proposed two attacks, namely fragment duplication and buffer 
reservation, which may target the 6LoWPAN Adaptation Layer. 

In fragment duplication attack, attackers duplicate a single fragment of a 
packet and thus force the receiving node to drop the fragments of the corre- 
sponding packet. In this attack, attackers abuse the approach of 6LoWPAN 
standard which deals with the duplicate fragments. The standard advises to 
drop the fragments of a packet in case of duplicates so as to get rid of the over- 
head of dealing with duplicates and save resources. However, malicious nodes 
can turn this naive mechanism into a DoS attack very easily. 

In Buffer reservation attacks, attackers reserve the buffer space of the tar- 
geted node with incomplete packets and keep it occupied as long as possible. 
Since resources are limited, nodes cannot afford to spare extra buffer space for 
the incomplete packets of other nodes. Thus, during the time the attacker holds 
the buffer space, ordinary nodes’ fragments cannot be accepted. Readers should 
note that, this behavior of 6LoWPAN is possible when 6LOWPAN is configured 
to forward the fragments according to the route-over approach, where all frag- 
ments of a packet are reassembled by the receiving node before being forwarded. 


5.4 D/DoS Attacks to the Network Layer 


D/DoS attacks which may target the IoT Network Layer can be divided into 
two categories: RPL-specific and non-RPL-specific attacks. RPL-specific attacks 
are rank, version number, local repair, DODAG inconsistency and DIS attacks. 
Non-RPL-specific attacks are the ones which are already known from the wireless 
sensor networks, and other communication networks research. Although they 
look old-fashioned, they are still applicable in RPL-based networks. Non-RPL- 
specific attacks are sybil, sinkhole, selective forwarding, wormhole, cloneID and 
neighbor attacks. 


RPL-Specific Attacks. D/DoS attacks which may target RPL networks abuse 
the vulnerabilities of the RPL protocol design. RPL, designed by the IETF for 
the routing of IPv6 packets on low power and lossy networks, has vulnerabilities 
with the control plane security and attackers can easily misuse it. In order to 
secure RPL networks, the IETF advises to use cryptography-based security solu- 
tions, secure control messages and some attack-specific countermeasures (e.g., 
using location information, multi-path routing) [52]. However, as explained in 
Sect. 4.3, there are several issues to consider with security and it is highly prob- 
able that RPL-based networks will be susceptible to D/DoS attacks. 

Rank is a very crucial parameter of the RPL protocol which represents a 
node’s position within the DODAG. This position is a relative distance of a node 
from the DODAG root. The distance is determined with respect to the Objective 
Function and can be based on the hop count, link quality, remaining power 
etc. Rank is used to create an efficient DODAG according to the application 
needs and to set up the child-parent relationship. For optimized and loop-free 
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DODAGs, nodes have to follow the rules. However malicious nodes may use rank 
in various ways to apply D/DoS attacks. In [27] and [28], an attacker node selects 
the neighbor with worst rank as a preferred parent instead of choosing the one 
with the best rank. Thus an inefficient DODAG is created which causes delays 
and an increased number of control messages. In [29], the attacker intentionally 
skips applying the rank check which breaks the rank rule constructing the loop- 
free parent-child relationship. 

RPL has two repair mechanisms in order to keep the DODAG healthy. One 
of them is the global repair operation where the entire DODAG is re-created. 
According to the RPL specification, only the DODAG root can initiate the 
global repair mechanism by incrementing the Version Number parameter. Every 
DODAG has a corresponding version number that is carried in DIO messages. 
When the root increments the version number, nodes in the RPL network find 
out the global repair operation by checking the version number in the incoming 
DIO messages. They exchange control messages and setup the new DODAG. 
However, there is no mechanism in RPL which guarantees that only the DODAG 
root can change the version number field. Malicious nodes can change the version 
number and force the entire network to set up the DODAG from scratch [11,35]. 
This attack is called Version Number attack and it affects the network with 
unnecessary control messages, delays, packet losses and reduced network lifetime. 

Similar to the global repair, the local repair mechanism of RPL can be the 
target of a D/DoS attack called local repair [27,29]. Local repair is an alternative 
repair solution of RPL which aims to solve the local inconsistencies and issues 
and cost less than the global repair mechanism since it involves a smaller portion 
of the network. If nodes find out inconsistencies (e.g., loops, packets with wrong 
direction indicators), then they start the local repair mechanism which consists 
of exchanging control messages and re-creating the parent-child relationships 
and getting appropriate ranks again. However malicious nodes can start local 
repair when there is no need so as to misuse the resources. This type of attack 
is called local repair attack. 

RPL has a data path validation mechanism, in which headers of the IPv6 data 
packets carry RPL flags that indicate the direction of the packet and possible 
inconsistencies with the rank of the previous sender/forwarder. When a node 
receives a packet with those flags indicating an inconsistency, it drops the packet 
and starts the local repair mechanism. In DODAG inconsistency attacks [49], 
attackers can set the corresponding flags of a data packet before they forward it 
and force the receiver node to drop the packet and start local repair. 

The last D/DoS attack specific to RPL is the DIS attack. DIS messages are 
used in RPL when a new node wants to join the network and therefore asking 
for information about the RPL network. Attackers can send unnecessary DIS 
messages in DIS attacks [27], which causes the neighboring nodes to reset their 
DIO timers and send DIO messages frequently. Thus, the attacker forces nodes 
to generate redundant control messages and consume more power. 
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Attacks not Specific to RPL. Attacks which are not specific but still appli- 
cable to RPL are neighbor, sybil, sinkhole, selective forwarding, wormhole and 
cloneID. 

A malicious node can apply the neighbor attack by retransmitting the routing 
control messages it hears [27]. This behavior causes neighbor nodes to think that 
the source of the control message is close to them and take actions accordingly. 
Actions could be sending control messages back, trying to select it as a preferred 
parent, etc. If the attacker uses a high power radio, then it may affect a large 
portion of the network by this way. 

In sybil attacks, a malicious node seems to act as multiple nodes, introducing 
itself with multiple logical identities [54]. If there is a voting mechanism running 
in the IoT network (e.g., voting based security mechanisms, cluster head selec- 
tion), attackers can apply sybil to change the results and thus take control of 
the complete network or a portion of the network. 

The CloneID attack is similar to the sybil attack but works in a different 
dimension. The attacker in this case places the clones of a malicious node or 
normal node to the multiple positions at the network [41,54]. This attack has 
similar aims as sybil and it may also be called node replication attack. 

Sinkhole attacks are another type of attacks where malicious nodes adver- 
tise good routing parameters to show themselves as candidate parents. In RPL, 
attackers can advertise good ranks, which causes the neighbor nodes to select 
it as the preferred parent [41,54,55]. When a malicious node is selected as the 
preferred parent by neighbor nodes, then it can apply other attacks, such as 
selective forwarding. 

In selective forwarding attacks, a malicious node inspects the incoming pack- 
ets, drops the ones it is interested in and forwards the rest [41,54]. For example, 
it may forward only the routing messages, whereas it may drop the data pack- 
ets. Or, malicious node may filter specific packets sourced from or destined to 
specific addresses. 

The last attack we explore in this category is the wormhole attack [39,54]. In 
wormhole attacks, at least a couple of malicious nodes create a hidden communi- 
cation channel by means of multiple radios and transfer the overheard messages 
transmitted at one end point to another. This may work bidirectional as well. 
By this way, two sets of nodes around each attacker believe that they are in the 
communication range of each other, which causes several issues. 


5.5 D/DoS Attacks to the Transport and Application Layer 


D/DoS attacks which may target Transport and Application Layers are flooding, 
desynchronization, SYN flood, protocol parsing, processing URI, proxying and 
caching, risk of amplification, cross-procotol and IP address spoofing attacks [21, 
50]. The majority of the attacks mentioned here were not studied in the literature 
and the IETF considers them as possible threats for CoAP. 
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6 Studies that Analyze D/DoS Attacks for Internet 
of Things 


The previous section was about the possible D/DoS attacks which can target 
the IoT networks. Starting from this section, we will analyze the studies for 
the aforementioned attacks. In this section, we will explore the works which 
investigate the effects of the attacks. 

Sokullu et al. [51] proposed GTS attacks to IEEE 802.15.4 in 2008. In their 
work, they also analyzed the effects of the attack in the bandwidth utilization 
of Contention Free Period (CFP). They considered single and multiple attackers 
where attackers can either attack randomly or intelligently. They found out 
significant decrease in the bandwidth utilization of CFP periods due to GTS 
attacks. 

Le et al. analyzed the rank attack in [28] in RPL networks in 2013. In this 
work, they applied the rank attacks with different cases where the attacker con- 
stantly applies the attack or switches between legitimate and malicious behaviors 
frequently. Analysis with respect to combinations of attacking cases show that if 
the rank attack is applied in a dense part of the network, then its effect is more 
detrimental. They also realized that, the number of affected nodes, number of 
generated DIO messages, average end-to-end delay and delivery ratio can be the 
indicator of such attacks. 

Mayzaud et al. studied RPL version number attacks in [35] in 2014. They 
investigation with a single attacker in a grid topology at varying positions showed 
that the location of the attacker is correlated to the effects of the attack. If the 
attacker is located far from the DODAG root within the grid, then its effect is 
larger than when attacker is closer to the root. 

The Version number attack is analyzed by another work [11] proposed by Aris 
et al. in 2016. In this study, the authors considered a factory environment con- 
sisting of varying topologies (i.e., grid and random) with different node mobilities 
(e.g., static and mobile nodes). A probabilistic attacker model is incorporated 
here. Based on the simulations, in addition to the location-effect correlation 
found in Mayzaud’s work [35], the authors found out that the mobile attackers’ 


Table 2. Categorization of the studies that analyze the D/DoS attacks for IoT 


Proposal Target attack Finding 
Sokullu [51] | GTS (MAC Layer) Significant bandwidth utilization decrease in 
CFP 
Le [28] Rank (Routing) Dense networks are more vulnerable 
Mayzaud [35] | Version Number Attacking position-effect of the attack 
(Routing) correlation 
Aris [11] Version Number Mobile attackers are more detrimental and 
(Routing) attack triples the power consumption of the 
network 
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effect can be as detrimental as the farthest attacking position in the network. 
They also showed that, version number attacks can increase the power consump- 
tion of the nodes by more than a factor of two. 

Table 2 categorizes the studies which analyze the effects of the D/DoS attacks 
for IoT. When we review the studies in this section, we realize that researchers 
focused on the IoT-specific attacks rather than the attacks which we are already 
familiar with from the Wireless Sensor Networks research (i.e., selective forward- 
ing, wormhole, sinkhole, etc.). In addition to this, three of the studies found out 
correlations with the success of the attack and the attack settings. Such findings 
can be extremely useful in defending IoT networks against the attackers and 
designing better detection and mitigation systems which consider these findings. 
In Table 1, we had provided a categorization of the D/DoS attacks for IoT and it 
is clear that many attacks have not been implemented and analyzed in a similar 
manner. 


7 Mitigation Systems and Protocol Security Solutions 
for Internet of Things 


Mitigation systems are proposed by researchers in order to minimize the effects 
of the attacks. Such systems are far from being a complete security solution but 
still can increase the strength of the system against attackers. In this context, 
existing protocols are enriched with additional features by the designers which 
can mitigate the detrimental effects of the D/DoS attacks. On the other hand, 
protocol security solutions referred here consist of mechanisms which aim to 
secure a communication protocol or a specific part of it. In this section, readers 
can find the studies which either propose a security solution or mitigate the 
effect of the attacks. 

Dvir et al. proposed VeRA [16], a security solution for the crucial version 
number and rank parameters carried in DIO messages in 2011. Their solution 
makes use of hash chains and message authentication codes in order to securely 
exchange these RPL parameters in DIO messages. 

Weekly et al. [55] evaluated the defense techniques for sinkhole attacks in 
RPL in 2012. They compared a reduced implementation of VeRA to their novel 
technique called as Parent Failover. Parent Failover uses Unheard Node Set 
which includes the IDs of the nodes that the BR did not hear from. Each node 
blacklists its parent if it sees itself in the list in this technique. 

Wallgren et al. [54] proposed implementations of routing attacks (i.e., selec- 
tive forwarding, sinkhole, hello flood, wormhole, sybil) which are not specific to 
RPL. They did not analyze the effects of the attacks. However, they made com- 
ments on possible mitigation/detection mechanisms against such attacks. Their 
mitigation ideas include usage of geographical location information, incorpora- 
tion of cryptography schemes, using multiple routes and/or RPL instances and 
keeping track of the number of nodes within the network. Although the authors 
suggest to use such mechanisms against the corresponding attacks, they did not 
implement the mitigation mechanisms and analyze the performance of them. 
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Hummen et al. [19] proposed two novel attacks to 6LOWPAN adaptation 
layer, which are fragment duplication attack and buffer reservation attack. They 
also proposed two novel mitigation mechanisms against these attacks. For frag- 
ment duplication attacks, the authors proposed hash chain structures which 
create a binding for fragments of a packet to the first fragment of the corre- 
sponding packet. In order to mitigate the effects of buffer reservation attacks, 
they suggested to split the reassembly buffer into fragment-sized slots and let 
multiple fragments belonging to different packets use it. They merged split buffer 
approach with a fragment discard mechanism in case of overloaded buffer con- 
ditions. 

In 2014, Sehgal et al. [49] proposed a mitigation study which targets DODAG 
inconsistency attacks. According to the authors, RPL uses a threshold to miti- 
gate the effects of such an attack. In RPL, a node receiving a data packet with 
flags indicating an inconsistency drops the packet and resets its trickle timer. 
A node can do this until reaching a threshold. After this threshold it does not 
reset the trickle timer any more. This proposal changes the constant threshold 
of RPL to an adaptive threshold to mitigate the effects of the attack better. 

Another mitigation technique for DODAG inconsistency attacks was pro- 
posed by Mayzaud et al. [33] in 2015. It is an improved version of the mitigation 
technique proposed in Sehgal’s work [49]. In the former study, packets with ‘R’ 
flags set were counted, whereas in this study, the number of trickle timer resets 
are counted. Based on this, a node either drops the packets and resets trickle 
timer, or forwards the packets with modifying the R and O flags to the normal 
state. 


Table 3. Categorization of the mitigation systems and protocol security solutions 


Proposal Target attack Mitigation/Security mechanism 
VeRA [16 Rank and Version Hash chains and Message Authentication 
Number (Routing) Codes 
Weekly [55] | Sinkhole (Routing) Reduced VeRA and Unheard Node Set 
Wallgren [54] Routing attacks not Geographical Location Info., 
specific to RPL Cryptography, Multiple Paths and 
Instances, Cardinality of the Network 
Hummen [19] Fragment Duplication | Content Chaining Using Hash Chains, 


and Buffer Reservation Split Buffer with Fragment Discard 


(MAC) 

Sehgal [49] DODAG Inconsistency | Adaptive Threshold for Inconsistency 
(Routing) Situations 

Mayzaud [33] | DODAG Inconsistency Adaptive Threshold for Inconsistency 
(Routing) Situations 

Ramani [43] | CloneID (Routing), Distributed Firewall 


General DoS 
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In 2016, Ramani proposed a two-way firewall [43] for low power and lossy 
networks. The two-way firewall analyzes the traffic destined to the 6LoWPAN 
network and traffic leaving from the network. The proposed firewall was tested 
against the CloneID and simple DoS attacks. The main module of the firewall 
works on the BR and becomes active when packets destined to the CoAP and 
DTLS ports are captured. Packets are parsed into incoming and outgoing pack- 
ets and their IP addresses and ports are verified. After this check, information 
related to the packet is saved and checked against the protocol rules. Erroneous 
packets are dropped here. Also the nodes in the 6LoWPAN network are equipped 
with mini-firewall modules which inform the main firewall about their behavior. 

Table 3 categorizes the Mitigation Systems and Protocol Security Solutions 
for IoT. Mitigation mechanisms against routing attacks constitute the majority 
of the studies in this section. Researchers targeted both RPL-specific attacks 
and other routing attacks which can be applied to RPL as well. Considering the 
resource limitations in IoT networks, we can see that three proposals use hash 
functions as lightweight solutions. 


8 Intrusion Detection Systems for Internet of Things 


In this section, we will survey the literature for intrusion detection systems 
proposed against D/DoS attacks for IoT networks. This section is organized as 
follows: Firstly, we will briefly give some background information about Intrusion 
Detection Systems (IDS). After that, we will analyze the IDSes proposed for IoT. 


8.1 Intrusion Detection Systems 


Intrusion Detection Systems serve as a strong line of defense for computer net- 
works against the attackers. Without IDS, the puzzle of a secure network is 
incomplete. As explained in Sect. 4.3, despite having cryptography-based solu- 
tions, attacks are still possible and IDS comes into the picture here, where it 
monitors and analyzes the traffic, data, behavior or resources and tries to pro- 
tect the network from attackers. 

Intrusion Detection Systems can be explored from various points of view. 
Figure5 shows a 3D Cartesian Plane of IDSes, where axes depict important 
categories which may be helpful to classify the IDSes. Although not shown, 
there may be other dimensions in this figure, such as operation frequency and 
targeted attacks. 


Intrusion Detection Systems: Detection Techniques. IDSes can be 
divided into four classes based on the detection technique. These are anomaly- 
based, signature-based, specification/rule-based and hybrid systems where the 
former two are the most popular ones. 

Anomaly-based systems learn the behavior of the system when there is no 
attack and create a profile. Deviations from the profile show possible anomalies. 
Anomaly-based IDSes can detect the new attacks since attacks are expected 
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Fig. 5. Intrusion detection systems 


to cause deviations from the ordinary behavior. However, they can create false 
alarms and incorrectly classify legitimate connections as intrusion attempts. In 
addition to this, anomaly-based techniques are generally believed to be more 
complex and to use more resources than the other detection techniques. 

Signature-based systems aim to detect intrusions by making use of attack 
signatures/patterns. Typically signatures are stored in a database and IDS tries 
to match them when analyzing the connections, packets or resources. If the 
database does not have a signature for an attack, which happens in case of new 
attacks, such systems cannot detect it. Otherwise they promise high detection 
rates for the known attacks and they do not suffer from false alarms. If we use 
signature-based techniques, we have to consider how to deal with new attacks 
since our IDS will probably skip them. Also we have to think about the storage 
cost of the signatures. 

Specification/rule based systems require specifications of the proto- 
cols/systems and create rules based on the specifications. These rules separate 
legitimate connections from the malicious ones. In such systems creation of the 
specification and coverage of the created rules are important issues which affect 
the performance of the IDS. 

Hybrid intrusion detection systems consider advantages and disadvantages 
of the previous three detection techniques and aim to benefit from multiple of 
them at the same time. Of course such a decision may be costly in terms of the 
available resources. 


Intrusion Detection Systems: Detection Resources. Intrusion Detection 
Systems can be divided into three categories in terms of the resources they use 
for detection. These are network-based, host-based and hybrid detection systems. 
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Network-based IDSes use the incoming and outgoing monitoring traffic 
to/from the network in addition to the internal traffic to detect the intrusions. 
Network-based IDSes can have a global view of the network and use it to boost 
the detection performance. However, such systems lack the information about 
the individual resource consumptions and logs of the nodes within the network 
which may be crucial for the detection of specific attacks. 

Host-based intrusion detection systems consider the traffic only coming to 
and leaving from the host. Such systems monitor the resources and logs of the 
hosts as well which may provide hints about attacks. Since they work locally, they 
cannot have a global knowledge about the state of other nodes or the network 
which can be very useful to increase the performance of the IDS. 

Hybrid IDSes combine the strengths of network-based and host-based systems 
that benefit from both network and node resources. 


Intrusion Detection Systems: Detection Architecture. Architecture of 
Intrusion Detection Systems can be centralized, distributed or hybrid. 

Centralized IDSes place the intrusion detection to a central location and all 
of the monitoring information has to be collected here. One of the main reasons 
to select a central point for intrusion detection can be the available resources. As 
mentioned previously, anomaly detection techniques can be resource-hungry and 
it may not be feasible to place them on every node due to resource-constraints. 
Therefore, a resource-rich node, such as border router, can accommodate the 
intrusion detection system. However, centralized systems come along with com- 
munication overhead since monitoring data has to be carried all the way to the 
central location. If malicious nodes prevent monitoring data from reaching to 
the centralized IDS, then they may achieve to mislead the IDS. 

In Distributed IDSes, intrusion detection runs locally at every node in the 
network. In order to afford an IDS at every node, designers have to tailor the 
detection technique or algorithm according to the available resources. This app- 
roach clearly does not have any communication overhead, however the IDS has 
only local information to analyze in order to detect the intrusions. 

Hybrid IDSes again harmonize both of the detection architectures and try to 
benefit from each of them as much as possible. In such systems, IDS is divided 
into modules and these modules are distributed along the network. Modules at 
every node can apply intrusion detection to a certain extend, may share less 
information (in comparison with centralized IDSes) with the centralized module 
and thus both reduce the communication overhead and enjoy the rich resources 
of the centralized module. 


8.2 Intrusion Detection Systems Proposed for Internet of Things 


Cho et al. [15] proposed a botnet detection mechanism for 6LoWPAN-based 
networks in 2009. They assumed that the nodes in the IoT network use TCP 
transport layer protocol. Nearly seven years before the Mirai botnet, this study 
considered how IoT networks can be used as a botnet for DDoS attacks. 
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The authors thought that, if there exists a malicious node on the forwarding 
path, then it can forge the packets and direct them to the target victim address 
based on the command of the bot master. A detection mechanism is placed at 
the 6LOWPAN gateway node which analyzes the TCP control fields, average 
packet lengths and number of connections to detect the botnets. The idea is 
based on hypothesis that the ordinary IoT traffic should be very homogeneous 
and botnet would cause significant deviations on the traffic. 

Le et al. proposed a specification-based IDS [29] for IoT in 2011. It targets 
rank and local repair attacks. Their work assumes that a monitoring network is 
set up at the start of the network with minimum number of trustful monitoring 
nodes which has full coverage of the RPL network and has capability to do 
additional monitoring jobs. In this context, it has a distributed architecture 
and it is a network-based IDS. Each Monitoring Node stores the IDs, ranks 
and preferred parents of neighboring nodes. MNs accommodate a Finite State 
Machine (FSM) of RPL with normal and anomaly states to detect the attacks. 
If a MN cannot decide whether a node is an attacker or not, then it can ask the 
other MNs. This IDS was not implemented and the authors did not specify the 
format of the communication between the monitoring nodes. 

Misra et al. [36] proposed a learning automata based IDS for DDoS attacks 
in IoT. When there is an attack taking place, packets belonging to the mali- 
cious entities need to be sampled and dropped. This study aims to optimize this 
sampling rate by means of Learning Automata (LA). Firstly DDoS attacks are 
detected at each IoT node in the network based on the serving capacity thresh- 
olds. When the source of the attack is identified, all of the nodes are informed 
about it. In the next step, each node samples the attack packets and drops them. 
This is when the LA solution comes to the scene. Sampling rate of the attack 
packets are optimized by means of the LA. 

SVELTE IDS [41] was proposed by Raza et al. in 2013. It targets sinkhole 
and selective forwarding attacks. It has a hybrid architecture where it places 
lightweight IDS modules (i.e., 6LoWPAN mapper client and mini firewall mod- 
ule) at the resource-constrained devices and the main IDS (i.e., (LoWPAN map- 
per, intrusion detection module and distributed mini firewall) at the resource-rich 
Border Router (BR). 6LoWPAN Mapper at the BR periodically sends requests to 
the mapper clients at nodes. Mapper clients reply with their ID, rank, their parent 
ID, IDs of neighbors and their ranks. Based on the collected information about the 
RPL DODAG, SVELTE compares ranks to find out inconsistencies. It also com- 
pares the elements of the white-list and elements of current RPL DODAG and uses 
nodes’ message transmission times to find out the filtered nodes. The mini fire- 
wall module is used to filter outsider attackers. In addition to these, nodes change 
their parents with respect to packet losses encountered. In terms of the detection 
resources used, we can classify SVELTE into network-based [DSes. We can also 
put it into the category of specification/rule-based IDSes. 

In the same year with SVELTE, Kasinathan et al. proposed a centralized 
and network-based IDS [21] and its demo [20] for 6LoWPAN-based IoT net- 
works. The motivation of this study is the drawback of centralized IDSes which 
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suffer from internal attackers. As mentioned in Sect. 8.1, internal attackers may 
prevent monitoring data from reaching the centralized IDS. This is due to the 
fact that monitoring data is sent through the shared wireless medium, which can 
be interfered by attackers. In this IDS architecture, the authors place monitor- 
ing probes to the IoT network which have wired connections to the centralized 
module. Evaluation of their architecture was done via a very simple scenario 
where they used an open source signature-based IDS. A monitoring probe sends 
monitoring data under the UDP flood attack. 

Amaral et al. [6] proposed a network-based IDS for IPv6 enabled WSNs. In 
the proposed scheme, watchdogs which employ network-based IDS are deployed 
in specific positions within the network. These nodes listen their neighbors and 
perform monitoring of exchanged packets. IDS modules at each watchdogs use 
rules to detect the intrusion attempts. These rules are transmitted to watchdogs 
through a dedicated channel. In order to dynamically configure the watchdogs, 
the authors used policy programming approach. 

In 2015, Pongle et al. proposed an IDS [39] for wormhole attacks. Their IDS 
has a hybrid architecture similar to SVELTE. Main IDS is located at the BR and 
lightweight modules are located at the nodes. This study assumes that the nodes 
are static and the location of each node is known by the BR at the beginning. 
The main IDS collects neighbor information from nodes and uses it to find out 
the suspected nodes whose distance is found to be more than the transmission 
range of a node. The probable attacker is detected by the IDS based on the 
collected Received Signal Strength Indicator (RSSI) measurements related to 
the suspected nodes. In terms of the detection resource, we can consider this 
study as a network-based IDS. And from the detection technique point of view, 
it can be counted as a specification/rule-based IDS. 

Another IDS proposed in 2015 was INTI [14] which targets sinkhole attacks. 
INTI consists of four modules. The first module is responsible for the cluster for- 
mation, which converts the RPL network to a cluster-based network. The second 
module monitors the routing operations. The third module is the attacker detec- 
tion module, where reputation and trust parameters are determined by means 
of Beta distribution. Each node sends its status information to its leader node, 
which in turn determines the trust and reputation values. Threshold values on 
these parameters define whether a node is an attacker or not. The fourth module 
isolates the attacker by broadcasting its information. INTI can be classified as an 
anomaly-based IDS with distributed IDS architecture. In terms of the detection 
resources, we can put it into the category of hybrid IDSes. 

Sedjelmaci et al. proposed an anomaly-detection technique [48] for low power 
and lossy networks in 2016. Unlike the other IDSes targeting specific attacks 
and aiming to detect them, this study focuses on the optimization of running 
times of detection systems. The motivation of the study is derived from the 
fact that anomaly-based systems require more resources compared to signature- 
based systems. If our system can afford to be a hybrid system, having both of the 
detection systems, then we have to optimize the running time of the anomaly- 
detection module in order to lengthen the lifetime of the network. The authors 
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choose game-theory in this study for the optimization of the running time of the 
anomaly detection system. They claim that, thanks to the game-theory, anomaly 
detection runs only when a new attack is expected to occur. Anomaly detection 
runs only during such time intervals and create attack signatures. The signature- 
based system in turn puts this signature to its database and runs more often 
than the anomaly-detection system. 

Mayzaud et al. proposed a detection system [32] for version number attacks 
in 2016. Their system uses the monitoring architecture which was proposed in 
the authors’ earlier work [34]. Their monitoring system makes use of the multiple 
instance support of RPL protocol. It consists of special monitoring nodes with 
long range communication radios. These nodes are assumed to be covering the 
whole network and can send the monitoring information to the DODAG root 
using the second RPL instance that was setup as the monitoring network. In the 
proposed IDS, monitoring nodes eavesdrop the communication around them and 
send the addresses of their neighbors and addresses of the nodes who sends DIO 
messages with incremented version numbers to the root. The root detects the 
malicious nodes by means of the collected monitoring information. However, the 
proposed technique suffers from high false positives. This IDS can be counted 
as a network-based IDS with centralized detection architecture. We can also 
categorize it as a specification/rule-based IDS. 

Another IDS proposed in 2016 was Saeed et al.’s work [45]. This study focuses 
on the attacks targeting a smart building/home environment where readings of 
sensors are sent to the server via a base station. In this study, the focus is on 
the attacks that target the base station. These attacks include software-based 
attacks and other attacks (i.e., performance degradation attacks, attacks to the 
integrity of the data). The anomaly-based with a centralized architecture is 
located at the base station. It consists of two layers. The first layer is respon- 
sible for analyzing the behavior of the system and detecting anomalies. It uses 
Random Neural Networks to create the profile and detect the anomalies. The 
second part is responsible from the software-based attacks. It comes up with a 
tagging mechanism to pointer variables. Accesses with the pointer are aimed to 
be limited with respect to the tag boundaries. 

Le et al. proposed a specification-based IDS [27] which is based on their 
previous work [29]. Their IDS targets rank, sinkhole, local repair, neighbor and 
DIS attacks. Firstly the proposal obtains an RPL specification via analysis of the 
trace files of extensive simulations of RPL networks without any attacker. After 
the analysis of the traces for each node, states, transitions and statistics of each 
state are obtained. These are merged to obtain a final FSM of RPL which helps 
them to find out instability states and required statistics. This study organizes 
the network in a clustered fashion. The IDS is placed at each Cluster Head 
(CH). CH sends requests to cluster members periodically. Members reply with 
neighbor lists, rank and parent information. For each member, CH stores RPL 
related information. CH runs five mechanisms within the concept of IDS. These 
mechanisms are understanding the illegitimate DIS messages and checks for fake 
DIO messages, rank inconsistencies and rules, and instability of the network. 
CH makes use of three thresholds to find out the attackers. These are number 
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of DIS state and instability state visits, and number of faults. This IDS can be 
counted as network-based IDS with a distributed architecture. 

Aris and Oktug proposed a novel IDS design [9] in 2017 which is an anomaly- 
based IDS with hybrid architecture. In this study lightweight monitoring mod- 
ules are placed at each IoT nodes and the main IDS is placed at the BR. Mon- 
itoring modules send RPL-related information and resource information of the 
node. The main IDS module periodically collects the monitoring information and 
also works as a firewall, where it can analyze the incoming and outgoing traffic 
from and to the Internet. In this study, each IDS module working on different 
RPL networks can share suspicious events information with each other. Each 
IDS works autonomously and detects anomalies using the monitoring informa- 
tion of 6LoWPAN network, firewall information and suspicious events informa- 
tion. When anomalies are detected, nodes within the network are informed via 
white-lists, whereas other IDSes are informed via the exported suspicious events 
information. This anomaly-based IDS is a hybrid IDS in terms of architecture 
and the detection resources used. 

Table 4 categorizes the IDSes for IoT. This table shows that majority of the 
systems are specification/rule-based. It clearly shows that, researchers focused 
on the protocols (i.e., RPL) rather than a common approach of creating a profile 
of normal behavior. This observation is also related to signature-based systems 
being rarely proposed for IoT. The reason may be due to the hardness of creating 
the signatures for the aforementioned attacks in IoT environments. In terms of 
the detection architecture, we can see that researchers consider every possible 
architecture and there is no outperforming option here. When we analyze the 
detection resources used, most of the studies are network-based IDSes. This 
shows that, node resources and logs are not yet used frequently by IoT security 


Table 4. Categorization of intrusion detection systems for IoT 


Proposal Det. arch | Det. technique Det. resource 
Cho [15] Centralized | Anomaly-based Network-based 
Le [29] Distributed | Specification/Rule-based Network-based 
Misra [36] Distributed | Specification/Rule-based Network-based 
SVELTE [41] | Hybrid Specification/Rule-based Network-based 
Kasinathan [21] | Centralized | Signature-based Network-based 
Amaral [6] Distributed | Specification/Rule-based Network-based 
Pongle [39] Hybrid Specification/Rule-based Network-based 
INTI [14] Distributed | Anomaly-based Hybrid 
Sedjelmaci [48] | Distributed | Hybrid Host-based 
Mayzaud [32] | Centralized | Specification/rule-based Network-based 
Saeed [45] Centralized | Anomaly-based Network-based 
Le [27] Distributed | Specification/rule-based | Network-based 
Aris [9] Hybrid Anomaly-based Hybrid 
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Table 5. Target attacks & Implementation environments of intrusion detection systems 
for IoT 


Proposal Target attacks Implem. env. 
Cho [15] Botnets Custom Simulation 
Le [29] Rank, Local Repair Not-implemented 
Misra [36] General DoS Custom Simulation 
SVELTE [41] | Sinkhole, Selective-forwarding Contiki Cooja 
Kasinathan [21] | UDP flooding Project Testbed 
Amaral [6] General DoS Project Testbed 
Pongle [39] Wormhole Contiki Cooja 
INTI [14] Sinkhole Contiki Cooja 
Sedjelmaci [48] | General DoS TinyOS TOSSIM 
Mayzaud [32] | Version Number Contiki Cooja 
Saeed [45] Software-based attacks, Integrity attacks, | Prototype impl. 
Flooding and other 
Le [27] Rank, Sinkhole, Local Repair, Neighbor, | Contiki Cooja 
DIS 


researchers. This may be due to the already limited resources of the nodes which 
may already be used 100% (e.g., RAM) or no space to store logs. But it is 
interesting to see that no proposal considers to use the deviation of the power 
consumption as an intrusion attempt. 

Table 5 compares the studies in terms of target attacks and implementation 
environments. The majority of the attacks targeted by [DSes for IoT are routing 
attacks as shown in the table. A big portion of the studies focus only on a single 
attack, whereas only a few studies consider multiple attacks. When we consider 
these attacks, nearly all of them are insider attacks. This means, IoT security 
researchers in this concept are not thinking of the threats sourced from the 
Internet yet. In addition to this, only one study targets software-based attacks. 
However, we know that embedded system developers choose programming in C 
language, which may open software-based vulnerabilities to the attackers tar- 
geting IoT. In terms of the implementation environment, Contiki Cooja is the 
environment selected by most of the researchers. 


9 Discussion and Open Issues 


In this study, we provided an extensive overview of Internet of Things security 
in order to ensure reliable Internet of Services for the future. Of course there 
may be other studies which were left unmentioned unintentionally. Considering 
the limitations, attacks, cryptography-based security solutions and studies in the 
literature, there are still several issues to research in order to reach a secure IoT 
environment. 
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One of the major points to consider is the usability and user experience when 
providing security to IoT environments. We have to consider users and provide 
user-friendly schemes which will not disturb the satisfaction of users while promis- 
ing security. This is directly related to the success and acceptance of our solutions. 
Otherwise, our efforts will be in vain, making the attackers’ job easier. 

As we have mentioned in Sect. 5.2, some of the MAC layer attacks make use 
of jamming attack to reach their aim. If we find a solution against jamming 
attacks, then this may makes it easier to mitigate the effects of such attacks. 

IDSes proposed for IoT use thresholds to decide whether a node/connection 
is malicious nor not. However, considering the proposals, thresholds seem to be 
set intuitively, not based on a scientific technique. This approach clearly limits 
the applicability and reproducibility of the proposed mechanism. The way we 
set thresholds may be an important issue to think about when designing [DSes. 

Assumptions of the studies are another point to re-consider. Some studies 
assume that there is a monitoring network covering the whole network with a 
minimum number of nodes and was setup at the beginning and is ready to use. 
Such assumptions have to be supported with deployment scenarios, otherwise it 
may not be realistic to have such assumptions. 

Anomaly-based and also specification-based IDSes typically require an attack- 
free period where the underlying system will able to understand the normal oper- 
ating conditions and create a profile accordingly. However, this may not be possible 
for real-life deployments. In addition to this, if our deployment includes thousands 
of nodes, then ensuring such a period may not even be feasible. 

Most of the studies target only a small number of attacks as mentioned in 
another study [60]. Researchers have to target a broader range of attacks or pro- 
pose systems which have the capability to be extended to detect other attacks too. 

In terms of the types of attacks, most of the studies focus only on insider 
attacks, whereas outsider attacks from the Internet have to be researched and 
analyzed. When we consider Table 1, we can see that attacks above the network 
layer were not studied extensively. This clearly shows that transport and appli- 
cation layers of IoT may be vulnerable to attacks and IoT will be mentioned a 
lot within news of DDoS attacks. 

Another issue with IoT security research is related to reproducibility and 
comparability of the studies. When we have a look at the studies, most of the 
authors keep the source codes of their implementations closed. In addition to 
this, IoT security research does not have datasets which can be used by the 
researchers as a common performance evaluation benchmark although testbeds 
that are publicly available exist. It would enrich the IoT security research if more 
researchers share their implementations with public and organizations provide 
datasets which can be used for evaluation purposes. 


10 Conclusion 


In this study while we aim to provide a comprehensive overview of security of IoT 
for reliable IoS, we incorporated the points of view that include unique character- 
istics of IoT environments and how they affect security, architectural components 
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of IoT and their relation to the standardized protocol stack, cryptography-based 
solutions and their detailed comparisons in addition to considerations on issues 
(i.e., implementation flaws, users and usability, physical security of the devices 
and trade offs), taxonomy of D/DoS attacks for IoT, analysis of the studies which 
analyze the effects of the attacks based on the attacks and findings, examination 
of mitigation systems and protocol security solutions with respect to mitigation 
mechanisms and targeted attacks, categorization of D/DoS attacks according to 
detection architecture, detection technique, detection resources as well as tar- 
geted attacks and implementation environments. 

Although we can think that cryptography will be enough for us, various 
issues open our networks to D/DoS attacks. D/DoS attacks are clearly threats 
not only for availability but also for reliability of future Internet of Services. 
There are various attacks and literature has several studies to secure the IoT 
networks against these attacks. When we consider the efforts, we cannot say that 
IoT security is over now. Clearly, there is still a lot to research and consider. 

Although majority of studies examined in this work target 6LoOWPAN net- 
works, security of emerging communication technologies such as LORaWAN, NB- 
IoT, Thread and many others needs attention of researchers. 

Based on our analysis, we can say that a plethora of research exists for routing 
layer D/DoS attacks, whereas we can not see studies targeting the application 
layer of IoT. Therefore, security of the application layer considering the attacks 
and use-cases needs research. In addition to this, most of the studies do not focus 
on a broad range of attacks, but only a few. There is a need for proposals which 
are capable of targeting more attacks for IoT security research. 

Only a few papers consider IoT to be used as an attacking tool for D/DoS 
attacks by malicious entities. However, the predicted number of devices in IoT 
networks is in the order of billions and IoT applications will be weaved into the 
fabric of our daily lives. It will be very easy for attackers to target. Therefore, 
there is a serious need for studies which address this issue. 

Nevertheless, while researchers will focus into the mentioned issues as future 
research, they will face with several challenges including resource limitations, 
heterogeneity of devices and applications, usability and security awareness, man- 
agement and cost. 
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Abstract. Mobile Internet usage has increased significantly over the last 
decade and it is expected to grow to almost 4 billion users by 2020. 
Even after the great effort dedicated to improving the performance, there 
still exist unresolved questions and problems regarding the interaction 
between TCP and mobile broadband technologies such as LTE. This 
chapter presents a thorough investigation of the behavior of distinct TCP 
implementation under various network conditions in different LTE deploy- 
ments including to which extent TCP is capable of adapting to the rapid 
variability of mobile networks under different network loads, with dis- 
tinct flow types, during start-up phase and in mobile scenarios at different 
speeds. Loss-based algorithms tend to completely fill the queue, creating 
huge standing queues and inducing packet losses both under stillness and 
mobility circumstances. On the other side delay-based variants are capa- 
ble of limiting the standing queue size and decreasing the amount of pack- 
ets that are dropped in the eNodeB, but under some circumstances they 
are not able to reach the maximum capacity. Similarly, under mobility in 
which the radio conditions are more challenging for TCP, the loss-based 
TCP implementations offer better throughput and are able to better uti- 
lize available resources than the delay-based variants do. Finally, CUBIC 
under highly variable circumstances usually enters congestion avoidance 
phase prematurely, provoking a slower and longer start-up phase due to 
the use of Hybrid Slow-Start mechanism. Therefore, CUBIC is unable to 
efficiently utilize radio resources during shorter transmission sessions. 


Keywords: TCP adaptability - LTE - Flow size - Slow-Start 
Mobility 


1 Introduction 


Mobile Internet usage has increased significantly over the last decade, growing 
almost 18-fold over the past 5 years and more than half a million new mobile 
devices and connections in 2016 [1]. The following years are expected to be 
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equally promising with 4G traffic reaching quotas of more than three-quarters of 
the total mobile traffic by 2021. The growth expectation is not only related to the 
traffic volume itself but also to the average speed. To continue this growth and 
to meet user expectations, all the involved stakeholders have a common interest 
in fast downloads, quick responses, high utilization and few packet losses. 

Since a large part of mobile Internet comprises TCP flows, the performance 
of TCP over cellular networks has become an important research topic. Even 
though in the last three decades many different TCP implementations have been 
developed [2] each of them targeting a different Congestion Control Algorithm 
(CCA), there still exists room for improvement in terms of achieved throughput 
and resulting delay over highly variable mobile networks. 

Previous studies and proposals have reported their results regarding the inter- 
action effects between mobile networks and TCP [3-5] and tried to define suitable 
CCAs for mobile networks [6]. However, none of them have extensively study the 
implication of a wide range of TCP implementations in a variety of static and 
moving scenarios. This chapter complements and extends previous works on 
mobile networks by studying and evaluating the behavior of a selection of TCP 
variants with different packet sizes, network loads, during start-up and mobility 
with different speeds, i.e. scenarios that are considered challenging for TCP. In 
order to appropriately study the different sources capable of impacting the final 
performance, the chapter suggests a bottom-up scenario with respect to com- 
plexity starting with static conditions so as to understand the responsiveness of 
TCP under distinct network status and load combinations and finishing with a 
variety of mobility scenarios. 

The chapter is organized as follows. Section 2 covers related work. In Sect. 3, a 
brief overview of the studied TCP variants is provided and the LTE testbeds are 
described. Next, in Sect. 4, we explain the methodology regarding the performed 
measurements and the studied scenarios. The findings and results from our work 
are presented in Sect. 5. Finally, Sect.6 concludes the chapter with a summary 
and a discussion of future work. 


2 Related Work 


TCP and LTE cellular access have been deeply studied throughout the last 
years. Most of the studies have either research the TCP side or mobile network 
side. However, a significant amount of researchers have been attracted by the 
interaction between TCP and LTE. 

One of the first basis of such interaction is the impact that radio retransmis- 
sions have into the delay increment and how they therefore degrade the achieved 
goodput [7,8]. It has been proven that the number of simultaneously active User 
Equipments (UEs) towards a common eNodeB has a huge impact on the effec- 
tive available bandwidth due to radio resources being shared. Thus, the work [9] 
found that sudden increases in background traffic load have an important effect 
in the Round-Trip Time (RTT) increment. This cross-traffic effect severely influ- 
ences the network playground for TCP, provoking sudden changes in the network 
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conditions and making TCP struggle while following the fluctuations in the avail- 
able capacity. This chapter compiles a more detailed treatment of the effects of 
buffering in the radio access part of LTE by also considering the performance of 
high-speed/long-delay variants of TCP in these kinds of networks. 

The so-called bufferbloat effect has also a huge impact into the performance of 
TCP over LTE [10]. The bufferbloat effect is possible due to the configuration of 
long queues both in the end-nodes and intermediate nodes, which can accumulate 
a great number of packets without any drop. However, that excessive packet 
buffering in a single queue in the end-to-end network path, causes a great latency 
increase and therefore, throughput degradation. Our work does not merely focus 
on bufferbloat, but considers the implications of different TCP variants in queue 
build-up under certain network conditions. 

Considering that many flows in Internet are short, it is important to verify 
the efficiency of TCP to carry out such transmissions over cellular networks, 
it has been demonstrated [11] that under some network conditions TCP fails 
to correctly utilize the available capacity and therefore, the flows last longer 
than necessary. The current work complements such works and analyzes the 
impact that different flow sizes have in the performance outcome of different 
TCP flavors. To this end, our work not only focuses on the stationary phases of 
TCP but also on its behavior during start-up due to its significant impact in short 
flows performance. In particular, we study the Hybrid Slow-Start scheme [12], 
and evaluate how it operates in LTE networks in comparison with the Standard 
Slow-Start scheme. 

Other studies have measured TCP over live LTE networks. Apart from the 
classic metrics of TCP throughput and RTT in [4] they also measured the delay 
caused by mobile devices going from idle to connected state. In [13], measurement 
trials were carried out over the cellular access of four Swedish operators and the 
diurnal variation of TCP throughput and delay were analyzed. [14,15] studies 
did similar TCP measurements, however, they did not consider daily variations. 
None of these live measurements took into account the impact of speed in the 
performance of TCP, or the behavior on different types of CCAs. So, to the 
best of our knowledge, our work both complements and extends these works 
through the study and evaluation of the behavior of common TCP variants in 
LTE networks under mobility with different speeds. 

There are only a few works that have considered the impact of different speeds 
on the performance of TCP over LTE networks. Even though some works [5] have 
studied different speeds, the primary metrics were more related to the radio part 
with spectral efficiency and share of resource blocks among the UEs. Even though 
the utilization of such radio resources was studied, one or two simple variants of 
TCP were utilized in a multi-user resource share, leading to TCP micro-effects 
masking. Also, in [3] the impact of speed on TCP in LTE was studied. The work 
focused on uplink and downlink throughput, RTTs and also considered time-of- 
day variations. Still, they did not consider how the CCA factor into the TCP 
performance at different velocities. Our work serves to cover all the options and 
extends the previous studies with multiple mobility patterns, different speeds 
and a wide range of CCAs. 
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3 Research Environment 


In order to compare the behavior of TCP in LTE networks, we first choose 
the TCP variants AND identify the LTE working parameters. This section first 
describes the most important features of the selected CCAs and later presents 
the LTE setup. 


3.1 TCP Variants 


TCP variants fall into three categories according to the CCA mechanism used: 
loss-based, delay-based and combined loss- and delay-based. Along this chapter, 
the analysis starts with five CCAs and, with every measurement phase, we will 
reduce the group, avoiding the repetitive usage of TCP solutions that do not 
work well in mobile networks. A brief overview of the TCP variants is given 
below together with the classification of CCAs in Table 1. 


Table 1. Selected TCP CCAs and their category 


CCA category Selected TCP CCA 

Loss-based TCP NewReno 
TCP CUBIC 

Delay-based TCP CDG 

Hybrid with bandwidth estimation Westwood+ 

Hybrid without bandwidth estimation | Illinois 


(i) TCP NewReno [16] employs the well-known additive increase multiplicative 
decrease (AIMD) mechanism that is common to most CCAs. During the 
Slow-Start period the cwnd increases by one packet per acknowledgment 
(ACK) reception until it reaches the value of ssthresh. Afterwards, the cwnd 
enters the congestion avoidance phase, with an increment of one packet per 
RTT period (standard synchronization with RTT or RTT-synchronized). If 
a 3-duplicate ACKs (3DUPACK) are received or a time-out occurs, the CCA 
deducts that some link is congested. After 3DUPACK, NewReno establishes 
the cwnd to the half (basic back-off) and the new ssthresh to previous cwnd. 
However, if a time-out occurs the cwnd will be decreased to one packet. 
NewReno is essential in the measurements since it represents the base TCP 
behavior. 

(ii) TCP CUBIC [17] employs a different mechanism compared with AIMD 
based on a cubical function. After a decrease of the cwnd, the cwnd ramps 
up in a concave shape, until it achieves the value that the cwnd had before 
the reduction. Afterwards, CUBIC increases its growth rate and ramps-up 
in a convex shape. CUBIC uses Hybrid Slow-Start [12] mechanism in the 
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sender instead of the Standard Slow-Start phase. Hybrid Slow-Start aims 
at finding the proper exit point for standard Slow-Start in order to avoid 
massive packet losses. The detection of such an exit point is based on the 
measurements of ACK trains and RTT delay samples. The TCP CUBIC 
implementation has been selected for the analysis due to its widespread 
use due to the fact that it currently is the default CCA in Linux servers, 
whose market share comprises the 67% of world-wide servers (as stated by 
W3Techs [18]). 

(iii) TCP CAIA delay gradient (CDG) [19] modifies the TCP sender to use 
RTT gradients as a congestion indicator. CDG also calculates the state of 
the bottleneck queue so that packet losses are treated as congestion signals 
only when the queue is full. Finally, CDG also uses Hybrid Slow-Start but 
with a more strict configuration than CUBIC. The selection of TCP CDG 
has been based on its novel use of delay gradients in the AIMD mechanism 
and to evaluate the actual usefulness of such a different feature in mobile 
networks. 

(iv) TCP Westwood+ [20] is capable of estimating the available bandwidth and 
minimum RTT (RTTmin) by measuring ACK inter-arrival times. The esti- 
mations are used to decide the new cwnd after a congestion episode of 
3DUPACK. With timeouts the ssthresh is calculated in accordance to the 
estimations and the cwnd is set to 1 segment. TCP Westwood-+ has been 
selected in this study for its hybrid behavior using loss-based mechanisms 
together with delay-awareness. 

(v) TCP Illinois [21] controls the AIMD mechanism by the estimated queu- 
ing delay and buffer size. In a normal situation when no queuing delay is 
detected, the cwnd is increased by 10 packets per RTT. If estimated delay 
starts increasing, the increment of cwnd will be gradually lowering until 
the minimum value of 0.3 packets per RTT is reached. When the RTT is 
considered as high as compared to the baseline RTT, the loss is considered 
as buffer overflow, whereas in low RTT the loss counts as packet corruption. 
Developed to perform efficiently within high speed networks, its loss-based 
and delay-awareness make a perfect candidate for our study. 


3.2 LTE Setup 


In order to evaluate the performance of LTE three different environments have 
been used: simulation, emulation and controlled deployment. Most of the work 
described in this chapter has been carried out over the simulated environment 
and for comparison purposes the findings and results have been correlated with 
the behavior in the other two deployments. Since the configuration and expla- 
nation of the simulated environment is comprised of many parameters and in 
order to help the reader understand the setup, Table2 gathers the most impor- 
tant information about the simulation environment regarding the configuration 
parameters and experiment-related conditions. 

As the simulated environment, ns-3 simulator with the LTE capabilities of 
LENA module is used. This module also allows to create standard-based fading 
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traces that can be applied to the channel between the UE and the eNodeB. 
Since ns-3 does not exactly use the available TCP implementations in the Linux 
kernel, we used Direct Code Execution (DCE) Cradle [22] to be able to run real 
TCP implementations in ns-3. In order to simulate the distance to the server, the 
propagation delay between the fixed remote host and the Packet Data Network 
Gateway (PGW) was set to 40 ms. In the Radio Link Control (RLC) layer we 
selected the Acknowledged Mode (AM) in order to resemble the most commonly 
deployed configuration in real-world. We modified the mechanism to be able to 
support a limitation is terms of packets, establishing in our setup a common 
packet buffer size in the eNodeB of 750 packets. Regarding the radio resources, 
the eNodeB was configured to have a standard value of 100 available physical 
resource blocks (PRB). We simulated the frequency band 7 (2600 MHz), one of 
the most commonly used commercial LTE frequency bands (in Europe). 

Background flows are used to load the network with multiple short TCP 
connections, similar to the behavior of real networks. The same TCP variant 
is used for both background and foreground traffic in order not to be affected 
by issues of TCP friendliness. The amount of data transferred in a background 
connection as well as the inter-arrival time between two connections were drawn 
from uniform random distributions. 

The controlled testbed aims at providing a measurement platform with the 
ability to measure TCP in more realistic radio conditions in order to confirm or 
reject the findings and assumptions made in simulated environment in relation to 
the behavior of TCP over LTE. We have used the iMinds’/iMEC’s LTE facility 
(LTE w-iLab.t [23]) in Zwijnaarde, Ghent. Apart from the provisioning of all the 
agents involved in LTE, the deployment allows ad-hoc mobility patterns while 


Table 2. Simulation parameters 


Simulation environment 


Simulator ns-3 LENA LTE model 
Linux Kernel 4.3 (DCE) 

CCA NewReno/CUBIC/Illinois/CDG/Westwood+ 
Parameter Value 

One-way delay PGW-Server | 40 ms 

MAC scheduler Proportional fair 

AMC model MiError 

Number of PRBs 100 

LTE band 7 (2600 MHz) 

RLC mode AM 

RLC transmission queue 750 PDUs 

Pathloss model FriisPropagationLossModel 
Fading models EVA60/EVA200 
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experimenting. It is important to underline that in this environment, the LTE 
transmissions are done over the air, thus allowing a proper study of TCP and 
LTE events. Even though the movement is real, the space limitation could limit 
the employed speed. 

The emulated testbed targets the validation of simulated results of TCP 
under mobility circumstances. To this end, a LTE emulator or LTE-in-a-box 
(Aeroflex 7100) has been used. This emulator is capable of creating the LTE 
radio signal and all the necessary LTE protocol events to support the attach- 
ment and registration of any LTE device through a radiofrequency cable or 
over the air. The tests have been completed with an smartphone, a couple of 
servers and a controller to synchronize the experiments and all the equipment 
involved during the assessments. Since the UE in the emulated testbed is not able 
to physically move, the controller would continuously manipulate the baseline 
Signal-to-interference-plus-noise ratio (SINR) levels and Aeroflex would apply 
the corresponding fading pattern so as to model actual movement. 


4 Methodology Description 


The intrinsic operation mode of LTE (i.e. resource sharing, scheduling, HARQ 
mechanisms) results in a constant change in the available capacity. Even consid- 
ering single-UE scenarios, different positions and fadings would lead to have a 
different SINR and it would therefore report a distinct Channel Quality Indica- 
tor (CQI) to the eNodeB. Thus, the eNodeB would assign a different available 
capacity for the channel of the UE through the Modulation and Coding Scheme 
(MCS) and transport block size (tbSize). Due such fluctuations in the radio side, 
the cwnd will be continuously evolving in order to obtain a resulting goodput 
as close as possible to the available capacity. The relative progressions of both 
parameters (available capacity and achieved capacity) play a fundamental role 
in the final performance. 

In this section, the applied methodology will be presented. Figure 1 shows 
the different scenarios that have been used in the analysis of the effects between 
TCP’s different CCAs and LTE. The methodology and reasoning of each scenario 
is explained below. 

(I) Implication of cross-traffic and responsiveness of TCP: The static sce- 
nario aims at providing insights of the evolution and responsiveness of TCP 
under different background traffics (I point in Fig. 1). There are three main goals 
with this scenario: the comparative study of TCP behavior with and without a 
loaded cell, the analysis of TCP focusing on short flows and the responsiveness 
comparison of TCP variants with a sudden capacity increase and decrease. Sev- 
eral metrics are gathered at different nodes along the path. At the source, TCP 
state information such as cwnd and ssthresh is saved. At the eNodeB, the trans- 
mission buffer length, the drop count and the Packet Data Convergence Protocol 
(PDCP) delay (i.e., the time it takes for a PDCP Protocol Data Unit -PDU- 
to go from the eNodeB to the UE), are logged. Finally, in the UE, the goodput 
is recorded. Since it is measured at the application level, packet losses and/or 
reordering may result in goodput spikes. 
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SGW/PGW Server 


Simulated environment -> ns-3 + DCE 


Controlled deployment Emulated testbed 
- w.iLab-t - Aeroflex 7100 


© Q 


Fig. 1. Scenarios in use. 


(II) Start-up performance: The aim of this scenario is to analyze the impact 
of different CCAs’ Slow-Start phases, such as the above mentioned standard 
Slow-Start and Hybrid Slow-Start, and determine their adequacy or inadequacy 
in broadband mobile networks. To that purpose, we deployed 10 static and scat- 
tered UEs (II point in Fig.1) in good radio conditions (CQI 15) so as to study 
the start-up performance in a simplified multi-user scenario and set some basis 
for the understanding of the following scenarios. In the server, the cwnd, RTT, 
outstanding data and goodput has been collected. 

(III & IV) Cell outwards/inwards movement resulting on decreasing /increa- 
sing available capacity: The decreasing quality movement scenario evaluates the 
behavior of TCP with a constantly worsening channel quality on average (III point 
in Fig. 1). The idea is to assess the CCA’s adaptability in a continuous capac- 
ity reduction (on average) environment and the impact of UE’s speed on the 
final performance. To help simulate different speeds, two Extended Vehicular A 
Model (EVA) fading patterns are applies: one for the velocity of 60km/h (com- 
mon limitation in rural roads) and one for 200km/h (common maximum speed 
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in high-speed trains). Apart from the usual metrics in the evaluation of CCAs, 
the main metric for simulated mobility-based scenarios is the relation between the 
available capacity (extracted from the tbSize) and the achieved goodput. On the 
other hand, the increasing quality movement represents the behavior of TCP ona 
constantly improving channel quality (IV point in Fig. 1). Therefore, these simu- 
lations aim at evaluating the CCA’s adaptability under different UE’s speeds in a 
continuous increasing capacity (on average) environment. 

(V) Correlation of TCP behavior in deployments as similar as possible to live 
commercial LTE networks: The scenario (V point in Fig. 1) aims at providing a 
measurement platform with the ability to measure TCP in more realistic radio 
conditions in order to confirm or reject the findings and assumptions made in sim- 
ulated environment in relation to the behavior of TCP over LTE. Since the equip- 
ment in the scenario is fully real (see description in [23]), the scheduling, queuing 
and the rest of the features that could have an impact on delay are realistic and 
represent more clearly what would happen in live scenarios, helping in the verifi- 
cation of findings. 

(VI) Emulated support to correlate mobility-based scenarios: Since the pre- 
vious scenario is limited in terms of speed, the emulated testbed (VI point in 
Fig. 1) due to the utilization of real UEs and the ability to emulate movement, 
is capable of confirming and clarifying performance trade-offs that in simulated 
environment could be blurry. In order to better understand the evolution of 
different performance-related parameters, in the server, the cwnd, RTT, out- 
standing data and goodput have been collected. 


5 Analysis of the Interactions Observed in Different 
Scenarios 


This section is divided in five main parts: implication of cross-traffic and respon- 
siveness of TCP (with scenario I), the start-up performance (with scenario II), 
both decreasing quality and increasing quality movement scenarios (with sce- 
nario III and IV), the correlation of findings in the controlled deployment (with 
scenario V) and finally, the correlation of findings regarding mobility scenarios’ 
over emulated testbed (with scenario VI). 


5.1 Cross-Traffic Impact and Responsiveness of TCP 


This subsection is responsible for covering different kind of traffic loads and 
behaviors while the UE is static. The location of the UE among the different 
measurements is the same and thus, the results are comparable. The subsection is 
divided in three main experiments: the comparison between a single-UE without 
cross-traffic and a loaded network, the impact of short flows and finally, sudden 
increase and decrease of the available capacity. 
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Fig. 2. Performance comparison of the selected CCAs: (a) Base single flow behavior; 
(b) Single flow behavior over loaded network. 
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Base Behavior and Behavior in a Loaded Network 

According to the selected position, the UE has a maximum throughput around 
the half of the total maximum (35 Mbps). Different experimental trials are car- 
ried out with and without background traffic to study the responsiveness of TCP 
and infer whether the background traffic has the same impact among the CCAs 
or not. In order to make easier the reading, Table3 gathers the most important 
point of the following explanation. 

The three subfigures on the left of Fig. 2 depict the results regarding the sce- 
nario with no background traffic. The differences between the loss-based TCP 
variants and delay-based ones are remarkable even in such a simplified scenario. 
Loss-based implementations manage to achieve the maximum capacity and cre- 
ate a long standing queue delay (up to 250 ms), whereas delay-based variants, 
such as CDG, keep the delay controlled but fail while trying to reach full resource 
utilization. In the case of Westwood-+, it is clear that the applied back-off after 
Slow-Start is very drastic and due to this, it takes longer to ramp-up. Illinois 
minimally reduces the cwnd, causing huge standing queue delay comparing with 
more conservative implementations like NewReno. In the case of CUBIC, it suf- 
fer for the deficient behavior of Hybrid Slow-Start. The mechanism exits to the 
congestion avoidance phase in an early stage and therefore reduces its growth 
pace far from the maximum achievable capacity, severely impacting in the time 
it takes to converge. 

The three subfigures on the right of Fig.2 show the outcome for the same 
scenario but with background traffic. The total target load of the background 


Table 3. Findings wrap-up in base behavior and behavior in a loaded network 


CCA Conditions Behavior 
CUBIC Base behavior | Slightly suffers for the deficient behavior of Hybrid 
Slow-Start 
Loaded network | No impact of Hybrid Slow-Start 
NewReno Base behavior | Easily achieves maximum capacity 
Loaded network | Similar behavior but with higher delay and more 
unstable goodput 
Illinois Base behavior | Easily achieves maximum capacity. However, it 
creates a huge standing queue 
Loaded network | Very similar to NewReno but with slightly higher 
delay 
CDG Base behavior | Keeps the delay controlled but fails while trying to 
reach full resource utilization 
Loaded network | The differences with loss-based CCAs are reduced 
Westwood-+ | Base behavior | Very aggressive back-off that impacts the time 
needed to ramp-up 
Loaded network | The impact of the back-off application is minimized 
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traffic is set to the 50% of the link capacity. The capacity reduction minimizes 
the performance gap between loss-based and delay-based variants and still, the 
more capacity a CCA gets, the harder impact it inflicts in terms of queuing 
delay (Illinois as an example). Big differences appear comparing with the base 
example without background traffic, mostly related to a significant increment in 
the queuing delay and the reduction of the gap in terms of capacity to reflect 
the differences amongst the CCAs. RTT-clocked CCAs suffer due to a lengthen 
of the time between implementation decisions. In contrast, CUBIC behaves bet- 
ter because it does not suffer for RTT increase. The scenario itself due to its 
reduction in the available capacity cushions the underperformance of Hybrid 


Slow-Start. 


Short Flows Study 
Live measurements have shown that many flows over Internet are small (90% of 


downstreams carry no more than 35.9 KB of data [4]). Therefore it is important 
to assess the impact that such load distribution has in final performance. In order 
to do so, the previous foreground TCP flow must be replaced by a succession of 
short flows following an exponential distribution regarding their amount of data. 


| CDG 


Illinois 


CDG 
Cubic 


New Reno 


0.2 


Illinois 
New Reno 
Westwood+ 


10 12 0 200 600 800 


400 
Queue length (packets) 


(b) 


4 6 8 
Throughput (Mbps) 


(a) 


Fig. 3. Throughput and queue size ECDF at 700 m 


The Fig.3 represent as an Empirical Cumulative Distribution Function 
(ECDF) the results obtained regarding the achieved throughput and standing 
queue size. In Fig. 4 the size of the flows and the number of induced drops are 
correlated. 

In Fig. 3a, it is clear that the achieved throughput is very similar among 
most the CCAs and their differences really appear regarding the amount of 
enqueued packets in Fig. 3b. The delay-based variant, CDG, successfully limits 
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Fig. 4. eNodeB drops at 700 m 


the enqueued packets while loss-based implementations overshoot causing a great 
standing queue. Due to the detected behavior of Westwood+ in the beginning 
of the transmissions and the short duration of the flows, it prompts little ability 
to inject packets in the eNodeB. In contrast, NewReno, Illinois and CUBIC 
happen to be the average solutions. If we compare two deficient solutions such as 
CUBIC and Westwood+, we clearly see that even with short flows and therefore 
quick transmission duration, the premature exit from Slow-Start for the former 
performs better than the excessive back-off of the latter. 

Considering the reported findings, Fig. 4 shows the number of packets that 
have exceeded the queue size with each flow size, being therefore dropped. It 
is clear that the more aggressive the CCA is, the more packet are dropped by 
the eNodeB. Illinois for instance has a more aggressive behavior in congestion 
avoidance phase. It enqueues more packets and gets more packets dropped. As a 
result Illinois suffers on average 100 more dropped packets than any other TCP 
candidate. Once again, the behavior of Hybrid Slow-Start is clearly shown. If 
we avoid the fact that the transmissions with Hybrid Slow-Start take slightly 
more time to be completed, it only suffers drops with longer transmissions and 
when it has congestion events, the number of losses are very few. With loss-based 
AIMD mechanisms, the drop packets metric appears to be directly related to 
the aggressiveness and back-off strategy. NewReno and Westwood+ have quite 
similar results (Table 4). 
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Table 4. Findings wrap-up in short flows study 


CCA Behavior 


CUBIC Thanks to the underperformance of Hybrid Slow-Start, it only suffers 
drops with longer transmissions and when it has congestion events, 
the number of losses are very few 


NewReno The average solution 


Illinois Very aggressive behavior that results in 100 more dropped packets on 
average 
CDG Successfully limits the enqueued packets 


Westwood-+ | In the beginning of the transmissions and the short duration of the 
flows, it prompts little ability to inject packets due to the aggressive 
back-off 


Sudden Increase and Decrease of the Available Capacity 

Once the main features of the CCAs have been detected in loaded scenarios in 
comparison with the base behavior as well as the impact of different short flows 
on the drop rate, it is important to study the responsiveness of CCAs in big 
and sudden capacity changes. To this end, two type of simulations are carried 
out: with the background traffic being stopped at 20s of the test and with the 
background traffic being started at 20s of the test. In order to make easier the 
reading, Table5 gathers the most important point of the following explanation. 

On the one hand, the left part of Fig.5 shows the results regarding the 
scenario with a sudden capacity increase. In general, as soon as the capacity 
increases, the queue size is lowered due to a release of previously enqueued pack- 
ets. It is clear that loss-based CCAs quickly respond to an additional bandwidth 
assignment. However, Westwood-+ still suffers from the excessive reduction of 
the cwnd after the Slow-Start phase. During the congestion avoidance phase, 
its AIMD mechanism is very conservative and the enqueued packets tend to be 
almost 0, therefore with a new and greater achievable capacity, the adaptation 
ability of the CCA is very weak. In the case of delay-based variants, since they 
mainly focus on reducing the delay over path, they usually fail to increase their 
pace and thus, the new available capacity is wasted. 

On the other hand, the right part of Fig.5 depicts the case in which the 
background traffic is activated at 20s. Due to the sudden reduction of avail- 
able capacity, the queue size suffer an instant increment because of the relation 
between the same number of incoming packets to the eNodeB and the drastic 
reduction of outgoing ones. The Fig. 5 clearly shows that all CCAs but CDG are 
able to successfully react to the capacity reduction. However, in some cases such 
as CUBIC, the CCA takes more time to stabilize to the new pace. 

These simulations reflect that most CCAs, even delay-based implementa- 
tions, are capable of reducing their throughput when sudden available capacity 
decreases happen but delay-based variants struggle to adapt their pace to band- 
width increases. 
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Fig. 5. Performance comparison of the selected CCAs: (a) Sudden capacity increase 


(b) Sudden capacity decrease. 
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Table 5. Findings wrap-up in sudden increase and decrease of the available capacity 


CCA Conditions Behavior 


CUBIC Inc. available cap. | Good performance without the impact of Hybrid 
Slow-Start due to the low available capacity at 
the beginning of the transmission 


Dec. available cap. | Impact of Hybrid Slow-Start in the beginning. 
Aggressive behavior in congestion avoidance 
phase that leads to an instant huge increment of 
queue size while reducing the available capacity 


NewReno Inc. available cap. | Good responsiveness and average delay impact 


Dec. available cap. | Average loss-based solution that suffers and 
instant standing queue increase while reducing 
the available capacity 


Illinois Inc. available cap. | Good responsiveness and greater induced delay 
than NewReno 

Dec. available cap. | Its aggressiveness is harmful in this scenario and 
takes some time to stabilize the goodput 


CDG Inc. available cap. | Fails to increase its pace and thus, the new 
available capacity is wasted 


Dec. available cap. | Bad performance in terms of goodput but full 
control of the delay that is always close to the 
baseline delay 


Westwood-+ | Inc. available cap. | Its AIMD mechanism is very conservative and 
the enqueued packets tend to be very few, being 
not capable of responding to a sudden greater 
capacity assignment 


Dec. available cap. | The combination of its dynamics (with a slow 
ramp-up ability) and the available capacity 
reduction happen to get the best performance 
due to the achievement of the maximum 
goodput and the lowest impact in terms of delay 


5.2 Start-Up Performance 


In very simplified scenarios, we have seen that the behavior of Standard Slow- 
Start and Hybrid Slow-Start differs leading in some occasions to a successful 
avoidance of massive losses with Hybrid Slow-Start. However, the LTE cells are 
usually more crowded and therefore the UEs could inflict more delay as cross- 
traffic that could impact Hybrid Slow-Start. The target is to assess whether the 
internal mechanisms of Hybrid Slow-Start could provoke an early exit from the 
standard ramp-up, following to a slow increment of the cwnd and therefore a 
significant underutilization of radio resources or not. 

We first measured the convergence behavior of the Standard Slow-Start, 
recording the packets that are in-flight at every moment. Later, we assessed 
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Fig. 6. Hybrid Slow-Start impact in mobile networks: injected packets during a stan- 
dard Slow-Start period. 


the same for Hybrid Slow-Start. Figure6 shows the probability density func- 
tion (PDF) of the number of injected packets for both mechanisms in the time 
Standard Slow-Start takes to converge. This is, we would compare in the fastest 
convergence period of time, the ability of both methods to put packets in-flight. 

Figure 6 shows the behavior of Standard Slow-Start has a equal distribution 
of packets in flight, whereas Hybrid has an imbalanced distribution presumably 
formed by the period of time in which Hybrid Slow-Start has ramped-up as 
Standard Slow-Start and the period after detecting a delay variation and behav- 
ing under the incremental pace of congestion avoidance phase. The distribution 
represents the huge difference between both methods regarding the ability to 
inject packets which leads to a extrapolation of the time needed to converge or 
achieve the maximum capacity from the beginning of the transmission. 

It is clear that not only in simplified scenarios, but also in multi-UE mea- 
surements, Hybrid Slow-Start suffers due to the detection of delay increment 
and the early trigger of exit condition from fast ramp-up. So, under some delay 
variability circumstances Hybrid Slow-Start slows-down the ramp-up of TCP. 
In some situations, this effect could underutilize the available radio resources 
and lengthens the time needed to converge, directly impacting on the quality 
experienced by users (QoE). 


5.3 Mobility Performance 


This subsection covers the analysis of decreasing quality and increasing quality 
movement for the selection of CCAs. Even though it has been proven in Subsect. 5.1 
that some CCAs fail in mobile networks (Westwood+ and CDG), they have been 
kept for comparison and confirmation purposes. In order to measure the ability 
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or inability of distinct TCP implementation to take advantage of radio resources, 
the results will be presented as the portion of tbSize that has been actually uti- 
lized every Transmission Time Interval (TTI). In other words, since the TTI is 
commonly configured in 1 ms, the portion of tbSize will show how many bits are 
used for the UE every millisecond. Considering that different MCS values lead to 
have distinct available capacity and therefore a different achievable throughput, 
the analysis is divided in MCS ranges. 


Decreasing Quality Movement 

The decreasing quality movement scenario stands for the continuous movement 
evolution of a certain UE from the eNodeB to a further location. In other words, 
on average the obtained SINR due to the distance from the UE to the eNodeB 
and the fading will have a tendency to be worse. So will be the reported CQI and 
the assigned MCS (instead of worse, it is a tendency to become a more robust 
modulation). In such a transition, the CCA will need to adapt to the different 
available capacities. Figure 7 shows the difference between the available capacity 
and the achieved capacity for different CCAs under distinct speeds, all classified 
by average MCS. 
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Fig. 7. Achieved/Available capacity at different speeds for different TCP variants 
(decreasing quality movement). 


Figure 7 clearly depicts three main areas: 


Slow-Start phase: Located in the coverage zone associated to MCS 28, during 
the transmission establishment and first ramp-up, the cwnd is not great enough 
to take full advantage of available radio resources. Considering that Standard 
Slow-Start converges very fast, the MCS 28 area also takes the first back-off 
application. For that reason Westwood+ or NewReno among others do not report 
the same result. Since the distance associated with a MCS is covered a lot faster 
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at 200km/h, the cwnd has no time to grow quickly enough and therefore the 
impact of ramp-up is more significant for the scenario at 200 km/h, prompting 
a lower value of achieved/available for this speed. 


“Bufferbloat” area: While in the area between MCS 26 to 18-20, the CCAs are 
able to take advantage of already enqueued packets in the eNodeB (bufferbloat 
effect). However the effect itself has a drawback in relation to the inflicted delay. 
This feature is more present in the examples at 60 km/h. The time spent in each 
MCS area makes it possible to the TCP variant to inject packets throughout 
a longer time, getting loss packets and requiring to recover from them under 
high-delay conditions and therefore, not allowing the CCA to achieve maximum 
capacity. 


Queue draining zone: Regardless the speed, it is an area in which the radio 
conditions are not good enough to maintain a full utilization of resources. Even 
though the average MCS values are between 18 to 14, fading conditions force 
the eNodeB to operate with very low MCS values (achieving sometimes MCS 
4 and 6) in some drastic fades. With each sudden fade, it is easier to receive 
more robust modulations, leading the packets to need stronger segmentation. 
As a side effect, both the queue size of the eNodeB and the delay increase. The 
recovery of losses in such network conditions is also a harmful process for TCP 
that leads to queue starvation events. When it comes to faster UE scenario, 
the eNodeB is able to lengthen the utilization of previously enqueued packet to 
further positions, therefore, the draining effect is slower or at least happens in 
further positions. 

Figure 7 shows that in decreasing quality movement the differences in loss- 
based CCAs are minimum, getting more credit of aggressiveness at 60 km/h and 
RTT-synchronization at 200km/h (NewReno and Illinois over CUBIC). Once 
again and even in a scenario that moves towards worse radio position, the delay- 
based variants have demonstrated to be unable to cope with the delay variability 
of LTE. CDG maintains a RTT close to the baseline RTT but underutilizes most 
of the assigned bandwidth. In the case of Westwood+, even though it is a scenario 
that helps get the maximum capacity to the weak AIMD mechanisms due to its 
continuous achievable capacity reduce, it takes very long time to achieve such a 
task at 60km/h and it is not capable of doing so at 200 km/h. 


Increasing Quality Movement 
Once analyzed the decreasing quality movement and the behavior of different 
CCAs under distinct speeds, it is necessary to study the increasing quality move- 
ment in a constant evolution of the channel quality to better positions. Consid- 
ering the findings in decreasing quality movement, it is important to determine 
whether the different methods of Slow-Start equally struggle under challenging 
radio condition or not and analyze whether the aggressiveness of TCP overshots 
sufficient packets to serve a continuous greater capacity or not. 

Trying to better explain the effects of this scenario in the beginning of the 
transmission and the relation between cwnd evolution and achieved goodput, 
Fig. 8 represents the relation between them. The graphs have been split for the 
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better understanding in two blocks: the result in relation to the cwnd evolu- 
tion is on the left and goodput’s cumulative sum on the right. Figure8 depicts 
the behavior difference between CUBIC with Hybrid Slow-Start and NewReno 
with Standard Slow-Start. It is clear that the network conditions are challenging 
because even in Standard Slow-Start the shape of the cwnd is very stepped. 
In such conditions in which the delay variability is also a hard drawback to 
tackle, the Hybrid Slow-Start mechanism detects an increment in the delay that 
is considered enough to trigger an early exit to congestion avoidance phase. The 
resultant cumulative goodput of both CCAs is represented on the right where 
the graphs shows a big outcome gap between both methods. Once again the 
underperformance of Hybrid Slow-Start is shown. Besides, in this case the early 
exit of fast ramp-up is provoked in a single-UE scenario in which the movement 
and fading are the only sources that vary the delay. 
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Fig. 8. NewReno vs. CUBIC in increasing quality scenarios at 200 km/h. 


Considering the explained effect regarding how Hybrid Slow-Start could 
affect the performance, we will now proceed to study the performance differ- 
ences under different speeds between NewReno, CUBIC, Westwood+, Illinois 
and CDG, classified by average MCS levels (see Fig. 9). At a first glance, the 
figure looks very similar to Fig. 7, but some differences are present. The behav- 
ior of such scenario is divided in two areas. 


Ramp-up phase: The hardest radio conditions for the channel are present from 
MCS 14 to 18. In such a challenging conditions the CCAs initialize the trans- 
mission and employ the selected Slow-Start method in a try to ramp-up and 
convergence as fast as possible without inducing a bursty loss event. As seen 
beforehand, at 200km/h the performances of Standard Sow-Start and Hybrid 
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Fig. 9. Achieved/Available capacity at different speeds for different TCP variants 
(increasing quality movement). 


Slow-Start are completely different, leading to a better utilization of the network 
resources in the case of Standard Slow-Start. Besides, in MCS 14in some occa- 
sions it is not only present the Slow-Start phase but part of the first back-off 
and application of congestion phase as well. The growth limitation is compara- 
tively very similar for 60 km/h and 200 km/h during this phase and establishes 
an undodgeable boundary for loss recovery. However, in faster scenarios the time 
spent in weakest radio conditions is less and the impact of such challenging con- 
ditions is less significant in the final outcome. Apart from that, in the case of 
Standard Slow-Start, at 200 km/h the first loss event will happen in better radio 
conditions than for 60 km/h and therefore, the ability to recover the lost packets 
is greater at 200km/h. 


Stationary area: Throughout MCS 20 to 28, TCP is able to take close to full 
advantage of available capacity. However, it has to be mentioned that, due to that 
transition speed and applied back-offs while recovering from losses, the CCAs 
are not able to rise sufficiently the cwnd, causing some channel underutilization. 

Even though, in general, the CCAs follow the identified phases, there are 
some differences among the CCAs that result in a distinct outcome for the same 
network conditions (see the wrap-up Table 6). 

Different Slow-Start methods affect the availability to take full advantage of 
radio resources, dividing the performance in two major groups. (1) Among the 
CCAs with same Slow-Start phase, some differences appear in MCS 14 due to 
the different AIMD policy applied when a loss is detected. As stated before, 
the higher speed, the comparatively longer Slow-Start phase and therefore, a 
decrease of the loss recovery effect due to the recovery taking place in better 
radio conditions. (2) For Hybrid Slow-Start mechanism, a difference between 
CUBIC and CDG appear regarding the delay sensitivity to quit fast ramp-up 
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Table 6. Findings wrap-up in mobility scenarios 


CCA Behavior 


CUBIC It only suffers the impact of Hybrid Slow-Start in the very 
beginning of the transmission in increasing quality movement 
pattern under high speed (see Fig. 8) 


NewReno Very good performance in terms of achieved available capacity in 
simplified single-user mobility scenarios 


Illinois The best results due to the combination of the delay-awareness and 
the aggressiveness 


CDG It has demonstrated very weak performance over cellular access 
under mobility in terms of bandwidth utilization 


Westwood-+ | Only able to reach full utilization after a long ramp-up period in 
decreasing quality movement at 60 km/h and in increasing quality 
movement at 200 km/h 


(as stated in Subsect.3.1). In relation to the effect of speed, the faster the UE 
moves, the higher delay variability and therefore quicker skip to a slow increase 
phase, suffering more wasted bandwidth utilization at 200 km/h. 

Westwood+ and CDG have been proven to be not adequate for mobile net- 
works. The former is able to reach full utilization after a long ramp-up period in 
decreasing quality movement at 60 km/h and in increasing quality movement at 
200km/h. The time spent is due to a poor available bandwidth estimation and 
consequent drastic back-off policy. At 200km/h in decreasing quality scenario 
the CCA does not allow sufficient time so as to achieve the maximum bandwidth. 
On the contrary, in increasing quality movement, the fastest scenario allows the 
CCA get the maximum capacity. The latter has demonstrated very weak perfor- 
mance over cellular access. It has to be underlined that the main objective of the 
CCA regarding the control of end-to-end delay is fulfilled, however, regardless 
the speed and scenario, the CCA has not been able to rise to the 10% of the 
available capacity, consequently leading to a 90% of resource underutilization. 
Therefore CDG is not suitable for mobile network as is configured now. 

The group formed by NewReno, CUBIC and Illinois have shown a very suc- 
cessful performance regardless the speed and movement pattern. As stated in 
previous explanation, the shortening of the challenging periods could make a 
difference in terms of greater achieved capacity. On average (see average values 
on the right) at 60 km/h 3 CCAs are very similar and it is only under 200 km/h 
speed circumstances when CUBIC performs poorly due to Hybrid Slow-Start 
and Illinois get a slight advantage of its delay-awareness to make the most of 
using available resources. 

All the gathered results are consistent with the findings regarding decreasing 
quality movement (in Subsubsect. 5.3), the performance of different Slow-Start 
methods (in Subsect. 5.2) and the preliminary analysis in regards to the impact 
of different cross-traffic in the performance of CCAs. However, since the results 
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have been obtained in a single LTE deployment, it is important to determine to 
which extent our findings could be extrapolated as a general-purpose behavior 
of CCAs and whether the results are biased towards the simulated/emulated 
testbed or not. 


5.4 Correlation of TCP Behavior over w-iLab.t LTE Testbed 


The current subsection aims at representing and explaining the behavior of a 
selection of CCAs over the controlled LTE testbed called w-iLab.t. Since the 
deployment is formed with completely real equipment (i.e. UEs, eNodeBs, fem- 
tocells, servers), the internal mechanisms of LTE and the interaction with TCP 
are closer to real-world behavior and therefore the variability is presumably 
higher comparing with simulated environment. Thus, such testbed allows carry- 
ing out experiments that represent the performance of the reality in a smaller 
scale. CDG was removed from the comparison set for its incompatibility with 
mobile networks. Westwood-+ is kept in the selection of CCAs to confirm or deny 
the underperformance under more variable circumstances. 

We configured three different paths to be followed by the robots with decreas- 
ing quality and increasing quality movements. The location of those movement 
patterns were located in different places of the femtocell, having a pattern close 
to the eNodeB, another one close to the spacial limits of the testbed and a third 
one in the middle of the previous two. After ten experiments over the different 
configurations/patterns, we gathered the following average throughput values 
for CUBIC, NewReno, Westwood-+ and Illinois. 
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Fig. 10. CCA comparison over w-iLab-t under mobility circumstances. 
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Figure 10 shows that the previous findings in ns-3 were accurate enough to 
explain the possible effect of CCAs in other LTE deployments. In fact, some 
deficiencies such as the ones regarding Hybrid Slow-Start and Westwood+ are 
more harmful than in simulation environment, causing a greater gap between 
the available capacity and the achieved one. 

In general three are the most important features to be underlined. First, 
the drastic back-off application of Westwood+ leads the CCA to be incapable 
of achieving the maximum capacity even within 20s of transmission. Looking 
at the growth tendency, the CCA may well take around 1 min to convergence 
which is an unacceptable value in order to provide a good service to the UEs. 
Second, the underperformance of Hybrid Slow-Start is more remarkable in this 
testbed and the results prompt a convergence time around 4.5 s. The performance 
difference with Standard Slow-Start (present in NewReno and Illinois) could be 
cushioned if the transmission is long enough (average value of CUBIC is close 
to NewReno or Illinois in 20s transmission). However, the impact in short-lived 
flows would be more notable. Third, the performance of NewReno and Illinois are 
very similar and the only distinction appear due to the greater aggressiveness 
of Illinois for its delay-awareness. Nevertheless, the utilized femtocells give a 
very good channel quality regardless the mobility pattern, movement patterns 
or speed. Thus, the “signal quality rings” that are present in real-world could 
not be represented. Therefore, in order to better understand the performance 
tradeoff of CUBIC, NewReno and Illinois in congestion avoidance phase during 
mobility circumstances, an additional analysis was demanded. 


5.5 Performance Tradeoff of Selected TCP Variants Under Mobility 
in Emulated Testbed 


Once the previous findings regarding the behavior of CCAs have been demon- 
strated in a controlled testbed, this subsection covers the comparison analysis 
of most adequate TCP flavors (CUBIC, NewReno and Illinois) over emulated 
testbed with mobile scenarios of decreasing quality and increasing quality move- 
ment. The previous scenarios have shown that CUBIC, NewReno and Illinois 
have a very close outcome. Therefore, this subsection will serve not only as a 
confirmation step of the findings in another testbed but to also carry out experi- 
ments in mobility circumstances with a realistic representation of “signal quality 
rings”. 

The testbed itself is not able to emulate movement due to the fixed position 
of the UE attached to a radio cable. Nonetheless, a computer that plays the role 
of a experiment controlled, is capable of establishing the baseline SINR at any 
moment. Besides, the lte-in-a-box called Aeroflex 7100 applies a fading pattern 
to such a variable baseline SINR, modelling this way the effect of movement with 
a static UE. 

To help decide the best timing for different baseline SINR values, averaged 
SINR traces obtained from ns-3 with a UE moving in decreasing quality and 
increasing quality movement patterns at 60km/h are used. In order to give more 
realism to the experiments, the EVA60 fading model in Aeroflex 7100 are applied. 
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We have decided to only use the scenarios at 60 km/h due to the result equality in 
ns-3. At 200 km/h the differences among CCAs were noticeable. Therefore, these 
experiments add additional information to the previous inconclusive outcomes 
and gives more insight regarding the differences among the selected CC. Figure 11 
depicts the average goodput, end-to-end delay and duplicated ACK (DUPACK) 
events per second as a sign of congestion for decreasing quality movement at the 
top and for increasing quality movement at the bottom. 


7 Tradeoff of CCAs behaviour in decreasing quality movement, pattern at 60 km/h 
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Fig. 11. Performance tradeoff of CCAs in the emulated testbed at 60km/h: (a) 
Decreasing quality movement; (b) Increasing quality movement. 
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In this case, the differences among the CCAs are noticeable for both move- 
ment patterns. The goodput results do not prompt any new feature and clas- 
sify the performance of the selected TCP implementations from better to worse 
as Illinois, CUBIC and NewReno. This outcome equally applies for decreas- 
ing quality and increasing quality movement, getting slightly more difference in 
increasing quality movement due to the continuous capacity increase and the 
availability to cushion overshots. The simplest way to proceed would be to say 
that Illinois is the best amongst the CCAs. However, depending on the perfor- 
mance objective, the decision could be another one. The reasons are manyfold. 
First, even though the goodput performance is better for Illinois, the induces 
delay and the consequent packet losses are a way larger than in the examples of 
CUBIC and NewReno. Second, if we compare the overall performance of CUBIC 
and NewReno, we see that in spite of the delay and DUPACK events being very 
similar, CUBIC makes the most in terms of goodput. Therefore, trying to avoid 
massive packet losses and delay infliction, the selection of CUBIC would be more 
desirable in this simple comparison. Third, for comparison purposes, since the 
objective of this scenario was the understanding of congestion avoidance phases 
and the adaptability to mobile LTE scenarios, the Hybrid Slow-Start mechanism 
was disabled. Taking into account this detail and depending on the requirements 
of the application, the selection of NewReno could not be discarded. To conclude 
this tradeoff study, it is clear that Illinois, CUBIC and NewReno have very sim- 
ilar results, but it cannot be easily decided whether one is better than the other 
because each of them has its “bright side” and drawback. 


6 Conclusion 


This chapter has tried to shed some light in the explanation of CCAs adaptability 
to different mobile network situation including the implication of different type 
of cross-traffics, the start-up phase and mobile UEs with increasing quality and 
decreasing quality movement patterns. The chapter has also included different 
LTE deployments so as to confirm and clarify the obtained result in the simulated 
environment. Table 7 wrap-ups the detected findings and confirmations of CCAs 
behavior under distinct circumstances. 

Simple static experimentation with different background traffic profiles and 
behaviors has demonstrated that loss-based TCP mechanisms reach the max- 
imum capacity quicker than delay-based variants. The former achieves greater 
throughputs but fails limiting the standing queue size and therefore inflict severe 
delays. The latter is able to keep the end-to-end delay close to the baseline delay 
value but struggle to ramp-up or speed up its injection pace, wasting this way a 
great amount of the available radio resources. 

Different scenarios have shown the huge impact that Hybrid Slow-Start mech- 
anism has under some delay variability circumstances. Having in mind that the 
delay’s instability is one of the main features in mobile networks, Hybrid Slow- 
Start is capable of slowing down the start-up phase leading to a bad resource 
utilization. Taking into account the widespread usage of CUBIC due to the 
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CCA Simulated env. Controlled testbed Emulated testbed 

CUBIC It suffers from its delay |Confirmed behavior of In simplified mobility 
sensitivity in Hybrid Hybrid Slow-Start with |scenarios, the cubical 
Slow-Start phase, being |even greater impact. congestion avoidance 
very harmful and Long transmission would |phase allows a good 
provoking mobile network |suffer such effect but it |available capacity 
capabilities would be more significant | utilization while the 
underutilization in short-lived ones delay is lower than with 

Illinois (closest CCA in 
terms of goodput) 

NewReno |I has responded very Confirmation of the good |Some precise mobility 
positively to different performance circumstances have 
network situation, shown a deficient 
showing that it is still a performance of NewReno 
good TCP candidate to leading to resource 
be utilized in certain underutilization and may 
situations. Its speed well indicate which 
weaknesses in fixed mobile network 
networks could result in a circumstances are not 
valuable feature in mobile suitable for the protocol 
networks 

Illinois Very similar to the Overall performance of | | Under mobility 
performance of NewReno | Illinois has been circumstances, a slight 
with bigger impact in demonstrated, showing in|gap increment in the 
delay due to its greater |close-to-the reality outcome of Illinois and 
aggressiveness. Such scenarios better NewReno has been found. 
aggressiveness allows performance than The results may indicate 
performing slightly better | NewReno in terms of that under more realistic 
in scenarios that require |achieved throughput conditions the breach will 
rapid adaptability (under be even greater 
mobility) 

CDG It has demonstrated very |- E 
weak performance with 
all scenarios over mobile 
networks in terms of 
bandwidth utilization. 
However it has shown a 
good control of the delay 
keeping it close to the 
baseline delay 

Westwood+|Found a problem with a |Confirmation of the z 


drastic back-off 
application that is 
capable of provoking 
underutilization of the 
radio resources under 
certain network 
situations 


findings noticing even 
greater impact of the 
deficiency. The closer to 
real-world, the poorer 
assessment of the 
available capacity and 
therefore, the more 
deficient the application 
of the back-off policy 
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presence of it by default in most Web servers, the problem is even worse. Even 
though, long transmissions suffer the impact of the underperformance of Hybrid 
Slow-Start, the effect is greater in the case of short-lived flows. 

Regarding the mobility scenarios, two have been studied. In decreasing qual- 
ity movement most CCAs are able to achieve the maximum capacity during 
good radio conditions and they lengthen the utilization of previously enqueued 
packets while running towards worse channel qualities. At higher speeds, the 
already enqueued packets are driven to further positions comparing with lower 
speeds, improving the average capacity utilization. In increasing quality move- 
ment, regardless the speed, the transmission initialization and first ramp-up 
happens in very challenging radio conditions, requiring CCAs availability to 
scale, recover from losses and AIMD mechanisms’ suitability to make the most 
of available capacity. 

In relation to the specific features of each CCAs’ adaptability, several findings 
have to be mentioned: (1) CUBIC suffers from its delay sensitivity in Hybrid 
Slow-Start phase, being very harmful and provoking mobile network capabili- 
ties underutilization. (2) CDG keeps the delay close to the baseline delay value 
but is incapable of growing its pace in order to utilize greater capacities. In 
its current state is not suitable for mobile networks and it could more suitable 
for wired networks where the delay variation in not that abrupt. However, the 
delay boundaries of the protocol may well be adapted to cellular networks’ con- 
straints. (3) Westwood+ has shown to be incapable to properly estimate the 
available bandwidth, leading to big cwnd reductions and the necessity to grow- 
up from very low values and very weak AIMD incremental pace.The adaptation 
of the estimation is required in order to make it suitable for mobile networks. 
(4) NewReno and Illinois have demonstrated to beat the other CCAs (apart from 
CUBIC in some situations) under different loads, traffic patterns, mobility and 
speed contexts. Even though in simulated environment the only detected differ- 
ence has appeared in increasing quality scenario in which the delay-awareness 
and greater aggressiveness has given to Illinois the best performance regarding 
the use of available capacity, in emulated testbed the differences have been also 
present in decreasing quality movement. Since the emulated testbed has shown 
a slight gap increment in the outcome of Illinois and NewReno, the results may 
indicate that under more realistic conditions the breach will be even greater. 

The feature-based findings have been confirmed over the LTE deployment 
of w-iLab.t and the performance tradeoff of the best CCAs has been explained 
under mobility circumstances in order to give insights regarding the appropriate 
selection depending on the application requirements. This chapter has given an 
overview of the behavior of the different TCP mechanisms in a LTE network 
under different circumstances. This work might be of value as a validation of the 
performance of different CCAs and as an indication of fruitful directions for the 
improvement of TCP congestion control over cellular networks. 

Some knowledge from the network would help TCP decide the best strategy 
in accordance to the network conditions. The envisioned scenario is aligned with 
the main features of mobile edge computing (MEC) management that would 
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allow removing as much end-to-end TCP variant dependency as possible. In the 
same way, other initiatives such as QUIC [24] that propose transport services 
in the user-space of the operating system with TCP-alike CCAs on top of UDP 
(UDP as a substrate) could take advantage of this comprehensive analysis in 
order to select the most appropriate TCP candidate (i.e. depending on multi- 
criteria that considers both network state and application requirements) in each 
network conditions and enable such TCP-alike implementation. 
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